Processing¶
This module contains code related to the Processor class, which is used for Processing jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation and interpretation on SageMaker.
-
class
sagemaker.processing.
Processor
(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
object
Handles Amazon SageMaker processing tasks.
Initialize a
Processor
instance. The Processor handles Amazon SageMaker processing tasks.Parameters: - role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
- image_uri (str) – The uri of the image to use for the processing jobs started by the Processor.
- instance_count (int) – The number of instances to run the Processing job with.
- instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
- entrypoint ([str]) – The entrypoint for the processing job.
- volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
- volume_kms_key (str) – A KMS key for the processing volume.
- output_kms_key (str) – The KMS key id for all ProcessingOutputs.
- max_runtime_in_seconds (int) – Timeout in seconds After this amount of time Amazon SageMaker terminates the job regardless of its current status.
- base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
- sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
- env (dict) – Environment variables to be passed to the processing job.
- tags ([dict]) – List of tags to be passed to the processing job.
- network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Run a processing job.
Parameters: - inputs ([sagemaker.processing.ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects.
- outputs ([sagemaker.processing.ProcessingOutput]) – Outputs for the processing job. These can be specified as either a path string or a ProcessingOutput object.
- arguments ([str]) – A list of string arguments to be passed to a processing job.
- wait (bool) – Whether the call should wait until the job completes (default: True).
- logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
- job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the image name and current timestamp.
- experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys, ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
-
class
sagemaker.processing.
ScriptProcessor
(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
sagemaker.processing.Processor
Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.
Initialize a
ScriptProcessor
instance. The ScriptProcessor handles Amazon SageMaker processing tasks for jobs using script mode.Parameters: - role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
- image_uri (str) – The uri of the image to use for the processing jobs started by the Processor.
- command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
- instance_count (int) – The number of instances to run the Processing job with.
- instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
- volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
- volume_kms_key (str) – A KMS key for the processing volume.
- output_kms_key (str) – The KMS key id for all ProcessingOutputs.
- max_runtime_in_seconds (int) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
- base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
- sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
- env (dict) – Environment variables to be passed to the processing job.
- tags ([dict]) – List of tags to be passed to the processing job.
- network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Run a processing job with Script Mode.
Parameters: - code (str) – This can be an S3 uri or a local path to either a directory or a file with the user’s script to run.
- inputs ([sagemaker.processing.ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects.
- outputs ([str or sagemaker.processing.ProcessingOutput]) – Outputs for the processing job. These can be specified as either a path string or a ProcessingOutput object.
- arguments ([str]) – A list of string arguments to be passed to a processing job.
- wait (bool) – Whether the call should wait until the job completes (default: True).
- logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
- job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the image name and current timestamp.
- experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys, ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
-
class
sagemaker.processing.
ProcessingJob
(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)¶ Bases:
sagemaker.job._Job
Provides functionality to start, describe, and stop processing jobs.
Initializes a Processing job.
Parameters: - sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, one is created using the default AWS configuration chain.
- job_name (str) – Name of the Processing job.
- inputs ([sagemaker.processing.ProcessingInput]) – A list of ProcessingInput objects.
- outputs ([sagemaker.processing.ProcessingOutput]) – A list of ProcessingOutput objects.
- output_kms_key (str) – The output kms key associated with the job. Defaults to None if not provided.
-
classmethod
start_new
(processor, inputs, outputs, experiment_config)¶ Start a new processing job using the provided inputs and outputs.
Parameters: - processor (sagemaker.processing.Processor) – The Processor instance that started the job.
- inputs ([sagemaker.processing.ProcessingInput]) – A list of ProcessingInput objects.
- outputs ([sagemaker.processing.ProcessingOutput]) – A list of ProcessingOutput objects.
- experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys, ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
Returns: - The instance of ProcessingJob created
using the current job name.
Return type:
-
classmethod
from_processing_name
(sagemaker_session, processing_job_name)¶ Initializes a Processing job from a Processing job name.
Parameters: - processing_job_name (str) – Name of the processing job.
- sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, one is created using the default AWS configuration chain.
Returns: - The instance of ProcessingJob created
using the current job name.
Return type:
-
classmethod
from_processing_arn
(sagemaker_session, processing_job_arn)¶ Initializes a Processing job from a Processing ARN.
Parameters: - processing_job_arn (str) – ARN of the processing job.
- sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, one is created using the default AWS configuration chain.
Returns: - The instance of ProcessingJob created
using the current job name.
Return type:
-
wait
(logs=True)¶ Wait for the Amazon SageMaker job to finish.
-
describe
()¶ Prints out a response from the DescribeProcessingJob API call.
-
stop
()¶ Stops the processing job.
-
class
sagemaker.processing.
ProcessingInput
(source, destination, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')¶ Bases:
object
Accepts parameters that specify an S3 input for a processing job and provides a method to turn those parameters into a dictionary.
Initialize a
ProcessingInput
instance. ProcessingInput accepts parameters that specify an S3 input for a processing job and provides a method to turn those parameters into a dictionary.Parameters: - source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to s3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
- destination (str) – The destination of the input.
- input_name (str) – The user-provided name for the input. If a name is not provided, one will be generated (eg. “input-1”).
- s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.
- s3_input_mode (str) – Valid options are “Pipe” or “File”.
- s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
- s3_compression_type (str) – Valid options are “None” or “Gzip”.
-
class
sagemaker.processing.
ProcessingOutput
(source, destination=None, output_name=None, s3_upload_mode='EndOfJob')¶ Bases:
object
Accepts parameters that specify an S3 output for a processing job and provides a method to turn those parameters into a dictionary.
Initialize a
ProcessingOutput
instance. ProcessingOutput accepts parameters that specify an S3 output for a processing job and provides a method to turn those parameters into a dictionary.Parameters: - source (str) – The source for the output.
- destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.
- output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
- s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.