Processing¶
This module contains code related to the Processor
class, which is used
for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing,
post-processing, feature engineering, data validation, and model evaluation,
and interpretation on Amazon SageMaker.
-
class
sagemaker.processing.
Processor
(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
object
Handles Amazon SageMaker Processing tasks.
Initializes a
Processor
instance. TheProcessor
handles Amazon SageMaker Processing tasks.- Parameters
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str) – The URI of the Docker image to use for the processing jobs.
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
entrypoint (list[str]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)¶ Runs a processing job.
- Parameters
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
- Raises
ValueError – if
logs
is True butwait
is False.
-
class
sagemaker.processing.
ScriptProcessor
(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
sagemaker.processing.Processor
Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.
Initializes a
ScriptProcessor
instance. TheScriptProcessor
handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for providing a script to be run as part of the Processing Job.- Parameters
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str) – The URI of the Docker image to use for the processing jobs.
command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)¶ Runs a processing job.
- Parameters
code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
-
class
sagemaker.processing.
ProcessingJob
(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)¶ Bases:
sagemaker.job._Job
Provides functionality to start, describe, and stop processing jobs.
Initializes a Processing job.
- Parameters
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.job_name (str) – Name of the Processing job.
inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects.outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects.output_kms_key (str) – The output KMS key associated with the job (default: None).
-
classmethod
start_new
(processor, inputs, outputs, experiment_config)¶ Starts a new processing job using the provided inputs and outputs.
- Parameters
processor (
Processor
) – TheProcessor
instance that started the job.inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects.outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects.experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
- Returns
- The instance of
ProcessingJob
created using the
Processor
.
- The instance of
- Return type
-
classmethod
from_processing_name
(sagemaker_session, processing_job_name)¶ Initializes a
ProcessingJob
from a processing job name.- Parameters
- Returns
- The instance of
ProcessingJob
created from the job name.
- The instance of
- Return type
-
classmethod
from_processing_arn
(sagemaker_session, processing_job_arn)¶ Initializes a
ProcessingJob
from a Processing ARN.- Parameters
- Returns
- The instance of
ProcessingJob
created from the processing job’s ARN.
- The instance of
- Return type
-
wait
(logs=True)¶ Waits for the processing job to complete.
- Parameters
logs (bool) – Whether to show the logs produced by the job (default: True).
-
describe
()¶ Prints out a response from the DescribeProcessingJob API call.
-
stop
()¶ Stops the processing job.
-
static
prepare_app_specification
(container_arguments, container_entrypoint, image_uri)¶ Prepares a dict that represents a ProcessingJob’s AppSpecification.
- Parameters
- Returns
Represents AppSpecification which configures the processing job to run a specified Docker container image.
- Return type
-
static
prepare_output_config
(kms_key_id, outputs)¶ Prepares a dict that represents a ProcessingOutputConfig.
- Parameters
kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt the processing job output. KmsKeyId can be an ID of a KMS key, ARN of a KMS key, alias of a KMS key, or alias of a KMS key. The KmsKeyId is applied to all outputs.
outputs (list[dict]) – Output configuration information for a processing job.
- Returns
Represents output configuration for the processing job.
- Return type
-
static
prepare_processing_resources
(instance_count, instance_type, volume_kms_key_id, volume_size_in_gb)¶ Prepares a dict that represents the ProcessingResources.
- Parameters
instance_count (int) – The number of ML compute instances to use in the processing job. For distributed processing jobs, specify a value greater than 1. The default value is 1.
instance_type (str) – The ML compute instance type for the processing job.
volume_kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) that run the processing job.
volume_size_in_gb (int) – The size of the ML storage volume in gigabytes that you want to provision. You must specify sufficient ML storage for your scenario.
- Returns
- Represents ProcessingResources which identifies the resources,
ML compute instances, and ML storage volumes to deploy for a processing job.
- Return type
-
class
sagemaker.processing.
ProcessingInput
(source, destination, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')¶ Bases:
object
Accepts parameters that specify an Amazon S3 input for a processing job.
Also provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingInput
instance.ProcessingInput
accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.- Parameters
source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
destination (str) – The destination of the input.
input_name (str) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).
s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.
s3_input_mode (str) – Valid options are “Pipe” or “File”.
s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
s3_compression_type (str) – Valid options are “None” or “Gzip”.
-
class
sagemaker.processing.
ProcessingOutput
(source, destination=None, output_name=None, s3_upload_mode='EndOfJob')¶ Bases:
object
Accepts parameters that specify an Amazon S3 output for a processing job.
It also provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingOutput
instance.ProcessingOutput
accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.- Parameters
source (str) – The source for the output.
destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.
output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.
This module is the entry to run spark processing script.
This module contains code related to Spark Processors, which are used for Processing jobs. These jobs let customers perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation on SageMaker using Spark and PySpark.
-
class
sagemaker.spark.processing.
PySparkProcessor
(role, instance_type, instance_count, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
sagemaker.spark.processing._SparkProcessorBase
Handles Amazon SageMaker processing tasks for jobs using PySpark.
Initialize an
PySparkProcessor
instance.The PySparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker PySpark.
- Parameters
framework_version (str) – The version of SageMaker PySpark.
py_version (str) – The version of python.
container_version (str) – The version of spark container.
role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
instance_count (int) – The number of instances to run the Processing job with. Defaults to 1.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume.
output_kms_key (str) – The KMS key id for all ProcessingOutputs.
max_runtime_in_seconds (int) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict) – Environment variables to be passed to the processing job.
tags ([dict]) – List of tags to be passed to the processing job.
network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)¶ Runs a processing job.
- Parameters
submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application
submit_py_files (list[str]) – List of paths (local or S3) to provide for spark-submit –py-files option
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
-
class
sagemaker.spark.processing.
SparkJarProcessor
(role, instance_type, instance_count, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
sagemaker.spark.processing._SparkProcessorBase
Handles Amazon SageMaker processing tasks for jobs using Spark with Java or Scala Jars.
Initialize a
SparkJarProcessor
instance.The SparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker Spark.
- Parameters
framework_version (str) – The version of SageMaker PySpark.
py_version (str) – The version of python.
container_version (str) – The version of spark container.
role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
instance_count (int) – The number of instances to run the Processing job with. Defaults to 1.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume.
output_kms_key (str) – The KMS key id for all ProcessingOutputs.
max_runtime_in_seconds (int) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict) – Environment variables to be passed to the processing job.
tags ([dict]) – List of tags to be passed to the processing job.
network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)¶ Runs a processing job.
- Parameters
submit_app (str) – Path (local or S3) to Jar file to submit to Spark as the primary application
submit_class (str) – Java class reference to submit to Spark as the primary application
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contais three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).