Processing¶

This module contains code related to the Processor class.

which is used for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation, and interpretation on Amazon SageMaker.

class sagemaker.processing.Processor(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶

Bases: object

Handles Amazon SageMaker Processing tasks.

Initializes a Processor instance.

The Processor handles Amazon SageMaker Processing tasks.

Parameters

role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str) – The URI of the Docker image to use for the processing jobs.
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
entrypoint (list[str]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

run(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)¶

Runs a processing job.

Parameters

inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

Raises

ValueError – if logs is True but wait is False.

class sagemaker.processing.ScriptProcessor(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶

Bases: sagemaker.processing.Processor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a ScriptProcessor instance.

The ScriptProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for providing a script to be run as part of the Processing Job.

Parameters

role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str) – The URI of the Docker image to use for the processing jobs.
command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

get_run_args(code, inputs=None, outputs=None, arguments=None)¶

Returns a RunArgs object.

For processors (PySparkProcessor, SparkJar) that have special run() arguments, this object contains the normalized arguments for passing to ProcessingStep.

Parameters

code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

run(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)¶

Runs a processing job.

Parameters

code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

class sagemaker.processing.ProcessingJob(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)¶

Bases: sagemaker.job._Job

Provides functionality to start, describe, and stop processing jobs.

Initializes a Processing job.

Parameters

sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
job_name (str) – Name of the Processing job.
inputs (list[ProcessingInput]) – A list of ProcessingInput objects.
outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.
output_kms_key (str) – The output KMS key associated with the job (default: None).

classmethod start_new(processor, inputs, outputs, experiment_config)¶

Starts a new processing job using the provided inputs and outputs.

Parameters

processor (Processor) – The Processor instance that started the job.
inputs (list[ProcessingInput]) – A list of ProcessingInput objects.
outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.

Returns

The instance of ProcessingJob created: using the Processor.

Return type

ProcessingJob

classmethod from_processing_name(sagemaker_session, processing_job_name)¶

Initializes a ProcessingJob from a processing job name.

Parameters

processing_job_name (str) – Name of the processing job.
sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

Returns

The instance of ProcessingJob created: from the job name.

Return type

ProcessingJob

classmethod from_processing_arn(sagemaker_session, processing_job_arn)¶

Initializes a ProcessingJob from a Processing ARN.

Parameters

processing_job_arn (str) – ARN of the processing job.
sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

Returns

The instance of ProcessingJob created: from the processing job’s ARN.

Return type

ProcessingJob

wait(logs=True)¶

Waits for the processing job to complete.

Parameters: logs (bool) – Whether to show the logs produced by the job (default: True).

describe()¶: Prints out a response from the DescribeProcessingJob API call.

stop()¶: Stops the processing job.

static prepare_app_specification(container_arguments, container_entrypoint, image_uri)¶

Prepares a dict that represents a ProcessingJob’s AppSpecification.

Parameters

container_arguments (list[str]) – The arguments for a container used to run a processing job.
container_entrypoint (list[str]) – The entrypoint for a container used to run a processing job.
image_uri (str) – The container image to be run by the processing job.

Returns

Represents AppSpecification which configures the processing job to run a specified Docker container image.

Return type

dict

static prepare_output_config(kms_key_id, outputs)¶

Prepares a dict that represents a ProcessingOutputConfig.

Parameters

kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt the processing job output. KmsKeyId can be an ID of a KMS key, ARN of a KMS key, alias of a KMS key, or alias of a KMS key. The KmsKeyId is applied to all outputs.
outputs (list[dict]) – Output configuration information for a processing job.

Returns

Represents output configuration for the processing job.

Return type

dict

static prepare_processing_resources(instance_count, instance_type, volume_kms_key_id, volume_size_in_gb)¶

Prepares a dict that represents the ProcessingResources.

Parameters

instance_count (int) – The number of ML compute instances to use in the processing job. For distributed processing jobs, specify a value greater than 1. The default value is 1.
instance_type (str) – The ML compute instance type for the processing job.
volume_kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) that run the processing job.
volume_size_in_gb (int) – The size of the ML storage volume in gigabytes that you want to provision. You must specify sufficient ML storage for your scenario.

Returns

Represents ProcessingResources which identifies the resources,: ML compute instances, and ML storage volumes to deploy for a processing job.

Return type

dict

static prepare_stopping_condition(max_runtime_in_seconds)¶

Prepares a dict that represents the job’s StoppingCondition.

Parameters: max_runtime_in_seconds (int) – Specifies the maximum runtime in seconds.
Returns: dict

class sagemaker.processing.ProcessingInput(source=None, destination=None, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None', s3_input=None, dataset_definition=None, app_managed=False)¶

Bases: object

Accepts parameters that specify an Amazon S3 input for a processing job.

Also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingInput instance.

ProcessingInput accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.

Parameters

source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
destination (str) – The destination of the input.
input_name (str) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).
s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.
s3_input_mode (str) – Valid options are “Pipe” or “File”.
s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
s3_compression_type (str) – Valid options are “None” or “Gzip”.
s3_input (S3Input) – Metadata of data objects stored in S3
dataset_definition (DatasetDefinition) – DatasetDefinition input
app_managed (bool) – Whether the input are managed by SageMaker or application

class sagemaker.processing.ProcessingOutput(source=None, destination=None, output_name=None, s3_upload_mode='EndOfJob', app_managed=False, feature_store_output=None)¶

Bases: object

Accepts parameters that specify an Amazon S3 output for a processing job.

It also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingOutput instance.

ProcessingOutput accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.

Parameters

source (str) – The source for the output.
destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.
output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.
app_managed (bool) – Whether the input are managed by SageMaker or application
feature_store_output (FeatureStoreOutput) – Configuration for processing job outputs of FeatureStore.

class sagemaker.processing.RunArgs(code, inputs=None, outputs=None, arguments=None)¶

Bases: object

Accepts parameters that correspond to ScriptProcessors.

An instance of this class is returned from the get_run_args() method on processors, and is used for normalizing the arguments so that they can be passed to ProcessingStep

Parameters

code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

Method generated by attrs for class RunArgs.

class sagemaker.processing.FeatureStoreOutput(**kwargs)¶

Bases: sagemaker.apiutils._base_types.ApiObject

Configuration for processing job outputs in Amazon SageMaker Feature Store.

Init ApiObject.

feature_group_name = None¶

This module is the entry to run spark processing script.

This module contains code related to Spark Processors, which are used for Processing jobs. These jobs let customers perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation on SageMaker using Spark and PySpark.

class sagemaker.spark.processing.PySparkProcessor(role, instance_type, instance_count, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶

Bases: sagemaker.spark.processing._SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using PySpark.

Initialize an PySparkProcessor instance.

The PySparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker PySpark.

Parameters

framework_version (str) – The version of SageMaker PySpark.
py_version (str) – The version of python.
container_version (str) – The version of spark container.
role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
instance_count (int) – The number of instances to run the Processing job with. Defaults to 1.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume.
output_kms_key (str) – The KMS key id for all ProcessingOutputs.
max_runtime_in_seconds (int) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict) – Environment variables to be passed to the processing job.
tags ([dict]) – List of tags to be passed to the processing job.
network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

get_run_args(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)¶

Returns a RunArgs object.

This object contains the normalized inputs, outputs: and arguments needed when using a PySparkProcessor in a ProcessingStep.

Parameters

submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object.
submit_py_files (list[str]) – List of paths (local or S3) to provide for spark-submit –py-files option
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

run(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)¶

Runs a processing job.

Parameters

submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application
submit_py_files (list[str]) – List of paths (local or S3) to provide for spark-submit –py-files option
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

class sagemaker.spark.processing.SparkJarProcessor(role, instance_type, instance_count, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶

Bases: sagemaker.spark.processing._SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using Spark with Java or Scala Jars.

Initialize a SparkJarProcessor instance.

The SparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker Spark.

Parameters

framework_version (str) – The version of SageMaker PySpark.
py_version (str) – The version of python.
container_version (str) – The version of spark container.
role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
instance_count (int) – The number of instances to run the Processing job with. Defaults to 1.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume.
output_kms_key (str) – The KMS key id for all ProcessingOutputs.
max_runtime_in_seconds (int) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict) – Environment variables to be passed to the processing job.
tags ([dict]) – List of tags to be passed to the processing job.
network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

get_run_args(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)¶

Returns a RunArgs object.

This object contains the normalized inputs, outputs: and arguments needed when using a SparkJarProcessor in a ProcessingStep.

Parameters

submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object
submit_class (str) – Java class reference to submit to Spark as the primary application
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

run(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)¶

Runs a processing job.

Parameters

submit_app (str) – Path (local or S3) to Jar file to submit to Spark as the primary application
submit_class (str) – Java class reference to submit to Spark as the primary application
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contais three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

class sagemaker.spark.processing.FileType(value)¶

Bases: enum.Enum

Enum of file type

JAR = 1¶

PYTHON = 2¶

FILE = 3¶

This module configures the SageMaker Clarify bias and model explainability processor job.

class sagemaker.clarify.DataConfig(s3_data_input_path, s3_output_path, label=None, headers=None, features=None, dataset_type='text/csv', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')¶

Bases: object

Config object related to configurations of the input and output dataset.

Initializes a configuration of both input and output datasets.

Parameters

s3_data_input_path (str) – Dataset S3 prefix/object URI.
s3_output_path (str) – S3 prefix to store the output.
label (str) – Target attribute of the model required by bias metrics (optional for SHAP) Specified as column name or index for CSV dataset, or as JSONPath for JSONLines.
headers (list[str]) – A list of column names in the input dataset.
features (str) – JSONPath for locating the feature columns for bias metrics if the dataset format is JSONLines.
dataset_type (str) – Format of the dataset. Valid values are “text/csv” for CSV and “application/jsonlines” for JSONLines.
s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
s3_compression_type (str) – Valid options are “None” or “Gzip”.

get_config()¶: Returns part of an analysis config dictionary.

class sagemaker.clarify.BiasConfig(label_values_or_threshold, facet_name, facet_values_or_threshold=None, group_name=None)¶

Bases: object

Config object related to bias configurations of the input dataset.

Initializes a configuration of the sensitive groups in the dataset.

Parameters

label_values_or_threshold (Any) – List of label values or threshold to indicate positive outcome used for bias metrics.
facet_name (str) – Sensitive attribute in the input data for which we like to compare metrics.
facet_values_or_threshold (list) – Optional list of values to form a sensitive group or threshold for a numeric facet column that defines the lower bound of a sensitive group. Defaults to considering each possible value as sensitive group and computing metrics vs all the other examples.
group_name (str) – Optional column name or index to indicate a group column to be used for the bias metric ‘Conditional Demographic Disparity in Labels - CDDL’ or ‘Conditional Demographic Disparity in Predicted Labels - CDDPL’.

get_config()¶: Returns part of an analysis config dictionary.

class sagemaker.clarify.ModelConfig(model_name, instance_count, instance_type, accept_type=None, content_type=None, content_template=None, custom_attributes=None, accelerator_type=None, endpoint_name_prefix=None)¶

Bases: object

Config object related to a model and its endpoint to be created.

Initializes a configuration of a model and the endpoint to be created for it.

Parameters

model_name (str) – Model name (as created by ‘CreateModel’).
instance_count (int) – The number of instances of a new endpoint for model inference.
instance_type (str) – The type of EC2 instance to use for model inference, for example, ‘ml.c5.xlarge’.
accept_type (str) – The model output format to be used for getting inferences with the shadow endpoint. Valid values are “text/csv” for CSV and “application/jsonlines”. Default is the same as content_type.
content_type (str) – The model input format to be used for getting inferences with the shadow endpoint. Valid values are “text/csv” for CSV and “application/jsonlines”. Default is the same as dataset format.
content_template (str) – A template string to be used to construct the model input from dataset instances. It is only used when “model_content_type” is “application/jsonlines”. The template should have one and only one placeholder $features which will be replaced by a features list for to form the model inference input.
custom_attributes (str) – Provides additional information about a request for an inference submitted to a model hosted at an Amazon SageMaker endpoint. The information is an opaque value that is forwarded verbatim. You could use this value, for example, to provide an ID that you can use to track a request or to provide other metadata that a service endpoint was programmed to process. The value must consist of no more than 1024 visible US-ASCII characters as specified in Section 3.3.6. Field Value Components ( https://tools.ietf.org/html/rfc7230#section-3.2.6) of the Hypertext Transfer Protocol (HTTP/1.1).
accelerator_type (str) – The Elastic Inference accelerator type to deploy to the model endpoint instance for making inferences to the model, see https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html.
endpoint_name_prefix (str) – The endpoint name prefix of a new endpoint. Must follow pattern “^[a-zA-Z0-9](-*[a-zA-Z0-9]”.

get_predictor_config()¶: Returns part of the predictor dictionary of the analysis config.

class sagemaker.clarify.ModelPredictedLabelConfig(label=None, probability=None, probability_threshold=None, label_headers=None)¶

Bases: object

Config object to extract a predicted label from the model output.

Initializes a model output config to extract the predicted label.

The following examples show different parameter configurations depending on the endpoint:

Regression Task: The model returns the score, e.g. 1.2. we don’t need to specify
anything. For json output, e.g. {‘score’: 1.2} we can set ‘label=’score’’.
Binary classification:
- The model returns a single probability and we would like to classify as ‘yes’
  those with a probability exceeding 0.2. We can set ‘probability_threshold=0.2, label_headers=’yes’’.
- The model returns {‘probability’: 0.3}, for which we would like to apply a
  threshold of 0.5 to obtain a predicted label in {0, 1}. In this case we can set ‘label=’probability’’.
- The model returns a tuple of the predicted label and the probability.
  In this case we can set ‘label=0’.
Multiclass classification:
- The model returns
  {‘labels’: [‘cat’, ‘dog’, ‘fish’], ‘probabilities’: [0.35, 0.25, 0.4]}. In this case we would set the ‘probability=’probabilities’’ and ‘label=’labels’’ and infer the predicted label to be ‘fish.’
- The model returns {‘predicted_label’: ‘fish’, ‘probabilities’: [0.35, 0.25, 0.4]}.
  In this case we would set the ‘label=’predicted_label’’.
- The model returns [0.35, 0.25, 0.4]. In this case, we can set
  ‘label_headers=[‘cat’,’dog’,’fish’]’ and infer the predicted label to be ‘fish.’

Parameters

label (str or int or list[int]) – Optional index or JSONPath location in the model output for the prediction. In case, this is a predicted label of the same type as the label in the dataset no further arguments need to be specified.
probability (str or int or list[int]) – Optional index or JSONPath location in the model output for the predicted scores.
probability_threshold (float) – An optional value for binary prediction tasks in which the model returns a probability, to indicate the threshold to convert the prediction to a boolean value. Default is 0.5.
label_headers (list) – List of label values - one for each score of the probability.

get_predictor_config()¶: Returns probability_threshold, predictor config.

class sagemaker.clarify.ExplainabilityConfig¶

Bases: abc.ABC

Abstract config class to configure an explainability method.

abstract get_explainability_config()¶: Returns config.

class sagemaker.clarify.SHAPConfig(baseline, num_samples, agg_method, use_logit=False, save_local_shap_values=True, seed=None)¶

Bases: sagemaker.clarify.ExplainabilityConfig

Config class of SHAP.

Initializes config for SHAP.

Parameters

baseline (str or list) – A list of rows (at least one) or S3 object URI to be used as the baseline dataset in the Kernel SHAP algorithm. The format should be the same as the dataset format. Each row should contain only the feature columns/values and omit the label column/values.
num_samples (int) – Number of samples to be used in the Kernel SHAP algorithm. This number determines the size of the generated synthetic dataset to compute the SHAP values.
agg_method (str) – Aggregation method for global SHAP values. Valid values are “mean_abs” (mean of absolute SHAP values for all instances), “median” (median of SHAP values for all instances) and “mean_sq” (mean of squared SHAP values for all instances).
use_logit (bool) – Indicator of whether the logit function is to be applied to the model predictions. Default is False. If “use_logit” is true then the SHAP values will have log-odds units.
save_local_shap_values (bool) – Indicator of whether to save the local SHAP values in the output location. Default is True.
seed (int) – seed value to get deterministic SHAP values. Default is None.

get_explainability_config()¶: Returns config.

class sagemaker.clarify.SageMakerClarifyProcessor(role, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, sagemaker_session=None, env=None, tags=None, network_config=None, version=None)¶

Bases: sagemaker.processing.Processor

Handles SageMaker Processing task to compute bias metrics and explain a model.

Initializes a Processor instance, computing bias metrics and model explanations.

Parameters

role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
version (str) – Clarify version want to be used.

run(**_)¶: Overriding the base class method but deferring to specific run_* methods.

run_pre_training_bias(data_config, data_bias_config, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶

Runs a ProcessingJob to compute the requested bias ‘methods’ of the input data.

Computes the requested methods that compare ‘methods’ (e.g. fraction of examples) for the sensitive group vs the other examples.

Parameters

data_config (DataConfig) – Config of the input/output data.
data_bias_config (BiasConfig) – Config of sensitive groups.
methods (str or list[str]) – Selector of a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to computing all. # TODO: Provide a pointer to the official documentation of those.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, a name is composed of “Clarify-Pretraining-Bias” and current timestamp.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.

run_post_training_bias(data_config, data_bias_config, model_config, model_predicted_label_config, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶

Runs a ProcessingJob to compute the requested bias ‘methods’ of the model predictions.

Spins up a model endpoint, runs inference over the input example in the ‘s3_data_input_path’ to obtain predicted labels. Computes a the requested methods that compare ‘methods’ (e.g. accuracy, precision, recall) for the sensitive group vs the other examples.

Parameters

data_config (DataConfig) – Config of the input/output data.
data_bias_config (BiasConfig) – Config of sensitive groups.
model_config (ModelConfig) – Config of the model and its endpoint to be created.
model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.
methods (str or list[str]) – Selector of a subset of potential metrics: # TODO: Provide a pointer to the official documentation of those. [“DPPL”, “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL”, “TE”, “FT”]. Defaults to computing all.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, a name is composed of “Clarify-Posttraining-Bias” and current timestamp.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.

run_bias(data_config, bias_config, model_config, model_predicted_label_config=None, pre_training_methods='all', post_training_methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶

Runs a ProcessingJob to compute the requested bias ‘methods’ of the model predictions.

Spins up a model endpoint, runs inference over the input example in the ‘s3_data_input_path’ to obtain predicted labels. Computes a the requested methods that compare ‘methods’ (e.g. accuracy, precision, recall) for the sensitive group vs the other examples.

Parameters

data_config (DataConfig) – Config of the input/output data.
bias_config (BiasConfig) – Config of sensitive groups.
model_config (ModelConfig) – Config of the model and its endpoint to be created.
model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.
pre_training_methods (str or list[str]) – Selector of a subset of potential metrics: # TODO: Provide a pointer to the official documentation of those. [“DPPL”, “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL”, “TE”, “FT”]. Defaults to computing all.
post_training_methods (str or list[str]) – Selector of a subset of potential metrics: # TODO: Provide a pointer to the official documentation of those. [“DPPL”, “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL”, “TE”, “FT”]. Defaults to computing all.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, a name is composed of “Clarify-Bias” and current timestamp.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.

run_explainability(data_config, model_config, explainability_config, model_scores=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶

Runs a ProcessingJob computing for each example in the input the feature importance.

Currently, only SHAP is supported as explainability method.

Spins up a model endpoint. For each input example in the ‘s3_data_input_path’ the SHAP algorithm determines feature importance, by creating ‘num_samples’ copies of the example with a subset of features replaced with values from the ‘baseline’. Model inference is run to see how the prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each of them. Across examples, feature importance is aggregated using ‘agg_method’.

Parameters

data_config (DataConfig) – Config of the input/output data.
model_config (ModelConfig) – Config of the model and its endpoint to be created.
explainability_config (ExplainabilityConfig) – Config of the specific explainability method. Currently, only SHAP is supported.
model_scores – Index or JSONPath location in the model output for the predicted scores to be explained. This is not required if the model output is a single score.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, a name is composed of “Clarify-Explainability” and current timestamp.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.