Processing¶
This module contains code related to the Processor
class, which is used
for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing,
post-processing, feature engineering, data validation, and model evaluation,
and interpretation on Amazon SageMaker.
-
class
sagemaker.processing.
Processor
(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
object
Handles Amazon SageMaker Processing tasks.
Initializes a
Processor
instance. TheProcessor
handles Amazon SageMaker Processing tasks.Parameters: - role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
- image_uri (str) – The URI of the Docker image to use for the processing jobs.
- instance_count (int) – The number of instances to run a processing job with.
- instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
- entrypoint (list[str]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.
- volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
- volume_kms_key (str) – A KMS key for the processing volume (default: None).
- output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
- max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status.
- base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
- sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain. - env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
- tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
- network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Runs a processing job.
Parameters: - inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None). - outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None). - arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
- wait (bool) – Whether the call should wait until the job completes (default: True).
- logs (bool) – Whether to show the logs produced by the job.
Only meaningful when
wait
is True (default: True). - job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
- experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
Raises: ValueError
– iflogs
is True butwait
is False.- inputs (list[
-
class
sagemaker.processing.
ScriptProcessor
(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
sagemaker.processing.Processor
Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.
Initializes a
ScriptProcessor
instance. TheScriptProcessor
handles Amazon SageMaker Processing tasks for jobs using a machine learning framework.Parameters: - role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
- image_uri (str) – The URI of the Docker image to use for the processing jobs.
- command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
- instance_count (int) – The number of instances to run a processing job with.
- instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
- volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
- volume_kms_key (str) – A KMS key for the processing volume (default: None).
- output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
- max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status.
- base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
- sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain. - env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
- tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
- network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run
(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Runs a processing job.
Parameters: - code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
- inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None). - outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None). - arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
- wait (bool) – Whether the call should wait until the job completes (default: True).
- logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
- job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
- experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
-
class
sagemaker.processing.
ProcessingJob
(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)¶ Bases:
sagemaker.job._Job
Provides functionality to start, describe, and stop processing jobs.
Initializes a Processing job.
Parameters: - sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain. - job_name (str) – Name of the Processing job.
- inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects. - outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects. - output_kms_key (str) – The output KMS key associated with the job (default: None).
-
classmethod
start_new
(processor, inputs, outputs, experiment_config)¶ Starts a new processing job using the provided inputs and outputs.
Parameters: - processor (
Processor
) – TheProcessor
instance that started the job. - inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects. - outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects. - experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
Returns: - The instance of
ProcessingJob
created using the
Processor
.
Return type: - processor (
-
classmethod
from_processing_name
(sagemaker_session, processing_job_name)¶ Initializes a
ProcessingJob
from a processing job name.Parameters: Returns: - The instance of
ProcessingJob
created from the job name.
Return type: - The instance of
-
classmethod
from_processing_arn
(sagemaker_session, processing_job_arn)¶ Initializes a
ProcessingJob
from a Processing ARN.Parameters: Returns: - The instance of
ProcessingJob
created from the processing job’s ARN.
Return type: - The instance of
-
wait
(logs=True)¶ Waits for the processing job to complete.
Parameters: logs (bool) – Whether to show the logs produced by the job (default: True).
-
describe
()¶ Prints out a response from the DescribeProcessingJob API call.
-
stop
()¶ Stops the processing job.
- sagemaker_session (
-
class
sagemaker.processing.
ProcessingInput
(source, destination, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')¶ Bases:
object
Accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingInput
instance.ProcessingInput
accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.Parameters: - source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
- destination (str) – The destination of the input.
- input_name (str) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).
- s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.
- s3_input_mode (str) – Valid options are “Pipe” or “File”.
- s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
- s3_compression_type (str) – Valid options are “None” or “Gzip”.
-
class
sagemaker.processing.
ProcessingOutput
(source, destination=None, output_name=None, s3_upload_mode='EndOfJob')¶ Bases:
object
Accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingOutput
instance.ProcessingOutput
accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.Parameters: - source (str) – The source for the output.
- destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.
- output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
- s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.