Processing

This module contains code related to the Processor class, which is used for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation, and interpretation on Amazon SageMaker.

class sagemaker.processing.Processor(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: object

Handles Amazon SageMaker Processing tasks.

Initializes a Processor instance. The Processor handles Amazon SageMaker Processing tasks.

Parameters:
  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
  • image_uri (str) – The URI of the Docker image to use for the processing jobs.
  • instance_count (int) – The number of instances to run a processing job with.
  • instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
  • entrypoint (list[str]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.
  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
  • volume_kms_key (str) – A KMS key for the processing volume (default: None).
  • output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
  • max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status.
  • base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
  • env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
  • tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
run(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)

Runs a processing job.

Parameters:
  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
  • wait (bool) – Whether the call should wait until the job completes (default: True).
  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
  • experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
Raises:

ValueError – if logs is True but wait is False.

class sagemaker.processing.ScriptProcessor(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.processing.Processor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a ScriptProcessor instance. The ScriptProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework.

Parameters:
  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
  • image_uri (str) – The URI of the Docker image to use for the processing jobs.
  • command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
  • instance_count (int) – The number of instances to run a processing job with.
  • instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
  • volume_kms_key (str) – A KMS key for the processing volume (default: None).
  • output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
  • max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status.
  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
  • env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
  • tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
run(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)

Runs a processing job.

Parameters:
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).
  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).
  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
  • wait (bool) – Whether the call should wait until the job completes (default: True).
  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
  • experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
class sagemaker.processing.ProcessingJob(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)

Bases: sagemaker.job._Job

Provides functionality to start, describe, and stop processing jobs.

Initializes a Processing job.

Parameters:
  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
  • job_name (str) – Name of the Processing job.
  • inputs (list[ProcessingInput]) – A list of ProcessingInput objects.
  • outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.
  • output_kms_key (str) – The output KMS key associated with the job (default: None).
classmethod start_new(processor, inputs, outputs, experiment_config)

Starts a new processing job using the provided inputs and outputs.

Parameters:
  • processor (Processor) – The Processor instance that started the job.
  • inputs (list[ProcessingInput]) – A list of ProcessingInput objects.
  • outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.
  • experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
Returns:

The instance of ProcessingJob created

using the Processor.

Return type:

ProcessingJob

classmethod from_processing_name(sagemaker_session, processing_job_name)

Initializes a ProcessingJob from a processing job name.

Parameters:
  • processing_job_name (str) – Name of the processing job.
  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
Returns:

The instance of ProcessingJob created

from the job name.

Return type:

ProcessingJob

classmethod from_processing_arn(sagemaker_session, processing_job_arn)

Initializes a ProcessingJob from a Processing ARN.

Parameters:
  • processing_job_arn (str) – ARN of the processing job.
  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
Returns:

The instance of ProcessingJob created

from the processing job’s ARN.

Return type:

ProcessingJob

wait(logs=True)

Waits for the processing job to complete.

Parameters:logs (bool) – Whether to show the logs produced by the job (default: True).
describe()

Prints out a response from the DescribeProcessingJob API call.

stop()

Stops the processing job.

class sagemaker.processing.ProcessingInput(source, destination, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')

Bases: object

Accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.

Initializes a ProcessingInput instance. ProcessingInput accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.

Parameters:
  • source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
  • destination (str) – The destination of the input.
  • input_name (str) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).
  • s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.
  • s3_input_mode (str) – Valid options are “Pipe” or “File”.
  • s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
  • s3_compression_type (str) – Valid options are “None” or “Gzip”.
class sagemaker.processing.ProcessingOutput(source, destination=None, output_name=None, s3_upload_mode='EndOfJob')

Bases: object

Accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.

Initializes a ProcessingOutput instance. ProcessingOutput accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.

Parameters:
  • source (str) – The source for the output.
  • destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.
  • output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
  • s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.