Processing¶
This module contains code related to the Processor
class, which is used
for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing,
post-processing, feature engineering, data validation, and model evaluation,
and interpretation on Amazon SageMaker.
-
class
sagemaker.processing.
Processor
(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
object
Handles Amazon SageMaker Processing tasks.
Initializes a
Processor
instance. TheProcessor
handles Amazon SageMaker Processing tasks.- Parameters
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str) – The URI of the Docker image to use for the processing jobs.
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
entrypoint (list[str]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Runs a processing job.
- Parameters
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
- Raises
ValueError – if
logs
is True butwait
is False.
-
class
sagemaker.processing.
ScriptProcessor
(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
sagemaker.processing.Processor
Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.
Initializes a
ScriptProcessor
instance. TheScriptProcessor
handles Amazon SageMaker Processing tasks for jobs using a machine learning framework.- Parameters
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str) – The URI of the Docker image to use for the processing jobs.
command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Runs a processing job.
- Parameters
code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
-
class
sagemaker.processing.
ProcessingJob
(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)¶ Bases:
sagemaker.job._Job
Provides functionality to start, describe, and stop processing jobs.
Initializes a Processing job.
- Parameters
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.job_name (str) – Name of the Processing job.
inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects.outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects.output_kms_key (str) – The output KMS key associated with the job (default: None).
-
classmethod
start_new
(processor, inputs, outputs, experiment_config)¶ Starts a new processing job using the provided inputs and outputs.
- Parameters
processor (
Processor
) – TheProcessor
instance that started the job.inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects.outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects.experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
- Returns
- The instance of
ProcessingJob
created using the
Processor
.
- The instance of
- Return type
-
classmethod
from_processing_name
(sagemaker_session, processing_job_name)¶ Initializes a
ProcessingJob
from a processing job name.- Parameters
- Returns
- The instance of
ProcessingJob
created from the job name.
- The instance of
- Return type
-
classmethod
from_processing_arn
(sagemaker_session, processing_job_arn)¶ Initializes a
ProcessingJob
from a Processing ARN.- Parameters
- Returns
- The instance of
ProcessingJob
created from the processing job’s ARN.
- The instance of
- Return type
-
wait
(logs=True)¶ Waits for the processing job to complete.
- Parameters
logs (bool) – Whether to show the logs produced by the job (default: True).
-
describe
()¶ Prints out a response from the DescribeProcessingJob API call.
-
stop
()¶ Stops the processing job.
-
class
sagemaker.processing.
ProcessingInput
(source, destination, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')¶ Bases:
object
Accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingInput
instance.ProcessingInput
accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.- Parameters
source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
destination (str) – The destination of the input.
input_name (str) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).
s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.
s3_input_mode (str) – Valid options are “Pipe” or “File”.
s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
s3_compression_type (str) – Valid options are “None” or “Gzip”.
-
class
sagemaker.processing.
ProcessingOutput
(source, destination=None, output_name=None, s3_upload_mode='EndOfJob')¶ Bases:
object
Accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingOutput
instance.ProcessingOutput
accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.- Parameters
source (str) – The source for the output.
destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.
output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.