Processing¶
This module contains code related to the Processor class, which is used
for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing,
post-processing, feature engineering, data validation, and model evaluation,
and interpretation on Amazon SageMaker.
-
class
sagemaker.processing.Processor(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
objectHandles Amazon SageMaker Processing tasks.
Initializes a
Processorinstance. TheProcessorhandles Amazon SageMaker Processing tasks.Parameters: - role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
- image_uri (str) – The URI of the Docker image to use for the processing jobs.
- instance_count (int) – The number of instances to run a processing job with.
- instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
- entrypoint (list[str]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.
- volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
- volume_kms_key (str) – A KMS key for the processing volume (default: None).
- output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
- max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status.
- base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
- sagemaker_session (
Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain. - env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
- tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
- network_config (
NetworkConfig) – ANetworkConfigobject that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Runs a processing job.
Parameters: - inputs (list[
ProcessingInput]) – Input files for the processing job. These must be provided asProcessingInputobjects (default: None). - outputs (list[
ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutputobjects (default: None). - arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
- wait (bool) – Whether the call should wait until the job completes (default: True).
- logs (bool) – Whether to show the logs produced by the job.
Only meaningful when
waitis True (default: True). - job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
- experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
Raises: ValueError– iflogsis True butwaitis False.- inputs (list[
-
class
sagemaker.processing.ScriptProcessor(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶ Bases:
sagemaker.processing.ProcessorHandles Amazon SageMaker processing tasks for jobs using a machine learning framework.
Initializes a
ScriptProcessorinstance. TheScriptProcessorhandles Amazon SageMaker Processing tasks for jobs using a machine learning framework.Parameters: - role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
- image_uri (str) – The URI of the Docker image to use for the processing jobs.
- command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
- instance_count (int) – The number of instances to run a processing job with.
- instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
- volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
- volume_kms_key (str) – A KMS key for the processing volume (default: None).
- output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
- max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status.
- base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
- sagemaker_session (
Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain. - env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
- tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
- network_config (
NetworkConfig) – ANetworkConfigobject that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
-
run(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Runs a processing job.
Parameters: - code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
- inputs (list[
ProcessingInput]) – Input files for the processing job. These must be provided asProcessingInputobjects (default: None). - outputs (list[
ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutputobjects (default: None). - arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
- wait (bool) – Whether the call should wait until the job completes (default: True).
- logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
- job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
- experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
-
class
sagemaker.processing.ProcessingJob(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)¶ Bases:
sagemaker.job._JobProvides functionality to start, describe, and stop processing jobs.
Initializes a Processing job.
Parameters: - sagemaker_session (
Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain. - job_name (str) – Name of the Processing job.
- inputs (list[
ProcessingInput]) – A list ofProcessingInputobjects. - outputs (list[
ProcessingOutput]) – A list ofProcessingOutputobjects. - output_kms_key (str) – The output KMS key associated with the job (default: None).
-
classmethod
start_new(processor, inputs, outputs, experiment_config)¶ Starts a new processing job using the provided inputs and outputs.
Parameters: - processor (
Processor) – TheProcessorinstance that started the job. - inputs (list[
ProcessingInput]) – A list ofProcessingInputobjects. - outputs (list[
ProcessingOutput]) – A list ofProcessingOutputobjects. - experiment_config (dict[str, str]) – Experiment management configuration. Dictionary contains three optional keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’.
Returns: - The instance of
ProcessingJobcreated using the
Processor.
Return type: - processor (
-
classmethod
from_processing_name(sagemaker_session, processing_job_name)¶ Initializes a
ProcessingJobfrom a processing job name.Parameters: Returns: - The instance of
ProcessingJobcreated from the job name.
Return type: - The instance of
-
classmethod
from_processing_arn(sagemaker_session, processing_job_arn)¶ Initializes a
ProcessingJobfrom a Processing ARN.Parameters: Returns: - The instance of
ProcessingJobcreated from the processing job’s ARN.
Return type: - The instance of
-
wait(logs=True)¶ Waits for the processing job to complete.
Parameters: logs (bool) – Whether to show the logs produced by the job (default: True).
-
describe()¶ Prints out a response from the DescribeProcessingJob API call.
-
stop()¶ Stops the processing job.
- sagemaker_session (
-
class
sagemaker.processing.ProcessingInput(source, destination, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')¶ Bases:
objectAccepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingInputinstance.ProcessingInputaccepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.Parameters: - source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
- destination (str) – The destination of the input.
- input_name (str) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).
- s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.
- s3_input_mode (str) – Valid options are “Pipe” or “File”.
- s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
- s3_compression_type (str) – Valid options are “None” or “Gzip”.
-
class
sagemaker.processing.ProcessingOutput(source, destination=None, output_name=None, s3_upload_mode='EndOfJob')¶ Bases:
objectAccepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingOutputinstance.ProcessingOutputaccepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.Parameters: - source (str) – The source for the output.
- destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.
- output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
- s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.