Airflow

training_config

sagemaker.workflow.airflow.training_config(estimator, inputs=None, job_name=None, mini_batch_size=None)

Export Airflow training config from an estimator

Parameters
  • estimator (sagemaker.estimator.EstimatorBase) – The estimator to export training config from. Can be a BYO estimator, Framework estimator or Amazon algorithm estimator.

  • inputs

    Information about the training data. Please refer to the fit() method of the associated estimator, as this can take any of the following forms: * (str) - The S3 location where training data is saved.

    • (dict[str, str] or dict[str, sagemaker.inputs.TrainingInput]) - If using multiple

      channels for training data, you can specify a dict mapping channel names to strings or TrainingInput() objects.

    • (sagemaker.inputs.TrainingInput) - Channel configuration for S3 data sources that can

      provide additional information about the training dataset. See sagemaker.inputs.TrainingInput() for full details.

    • (sagemaker.amazon.amazon_estimator.RecordSet) - A collection of

      Amazon :class:~`Record` objects serialized and stored in S3. For use with an estimator for an Amazon algorithm.

    • (list[sagemaker.amazon.amazon_estimator.RecordSet]) - A list of

      :class:~`sagemaker.amazon.amazon_estimator.RecordSet` objects, where each instance is a different channel of training data.

  • job_name (str) – Specify a training job name if needed.

  • mini_batch_size (int) – Specify this argument only when estimator is a built-in estimator of an Amazon algorithm. For other estimators, batch size should be specified in the estimator.

Returns

Training config that can be directly used by SageMakerTrainingOperator in Airflow.

Return type

dict

tuning_config

sagemaker.workflow.airflow.tuning_config(tuner, inputs, job_name=None, include_cls_metadata=False, mini_batch_size=None)

Export Airflow tuning config from a HyperparameterTuner

Parameters
  • tuner (sagemaker.tuner.HyperparameterTuner) – The tuner to export tuning config from.

  • inputs

    Information about the training data. Please refer to the fit() method of the associated estimator in the tuner, as this can take any of the following forms:

    • (str) - The S3 location where training data is saved.

    • (dict[str, str] or dict[str, sagemaker.inputs.TrainingInput]) - If using multiple

      channels for training data, you can specify a dict mapping channel names to strings or TrainingInput() objects.

    • (sagemaker.inputs.TrainingInput) - Channel configuration for S3 data sources that can

      provide additional information about the training dataset. See sagemaker.inputs.TrainingInput() for full details.

    • (sagemaker.amazon.amazon_estimator.RecordSet) - A collection of

      Amazon :class:~`Record` objects serialized and stored in S3. For use with an estimator for an Amazon algorithm.

    • (list[sagemaker.amazon.amazon_estimator.RecordSet]) - A list of

      :class:~`sagemaker.amazon.amazon_estimator.RecordSet` objects, where each instance is a different channel of training data.

    • (dict[str, one the forms above]): Required by only tuners created via

      the factory method HyperparameterTuner.create(). The keys should be the same estimator names as keys for the estimator_dict argument of the HyperparameterTuner.create() method.

  • job_name (str) – Specify a tuning job name if needed.

  • include_cls_metadata

    It can take one of the following two forms.

    • (bool) - Whether or not the hyperparameter tuning job should include information

      about the estimator class (default: False). This information is passed as a hyperparameter, so if the algorithm you are using cannot handle unknown hyperparameters (e.g. an Amazon SageMaker built-in algorithm that does not have a custom estimator in the Python SDK), then set include_cls_metadata to False.

    • (dict[str, bool]) - This version should be used for tuners created via the factory

      method HyperparameterTuner.create(), to specify the flag for individual estimators provided in the estimator_dict argument of the method. The keys would be the same estimator names as in estimator_dict. If one estimator doesn’t need the flag set, then no need to include it in the dictionary. If none of the estimators need the flag set, then an empty dictionary {} must be used.

  • mini_batch_size

    It can take one of the following two forms.

    • (int) - Specify this argument only when estimator is a built-in estimator of an

      Amazon algorithm. For other estimators, batch size should be specified in the estimator.

    • (dict[str, int]) - This version should be used for tuners created via the factory

      method HyperparameterTuner.create(), to specify the value for individual estimators provided in the estimator_dict argument of the method. The keys would be the same estimator names as in estimator_dict. If one estimator doesn’t need the value set, then no need to include it in the dictionary. If none of the estimators need the value set, then an empty dictionary {} must be used.

Returns

Tuning config that can be directly used by SageMakerTuningOperator in Airflow.

Return type

dict

model_config

sagemaker.workflow.airflow.model_config(model, instance_type=None, role=None, image_uri=None)

Export Airflow model config from a SageMaker model

Parameters
  • model (sagemaker.model.Model) – The Model object from which to export the Airflow config

  • instance_type (str) – The EC2 instance type to deploy this Model to. For example, ‘ml.p2.xlarge’

  • role (str) – The ExecutionRoleArn IAM Role ARN for the model

  • image_uri (str) – An Docker image URI to use for deploying the model

Returns

Model config that can be directly used by SageMakerModelOperator

in Airflow. It can also be part of the config used by SageMakerEndpointOperator and SageMakerTransformOperator in Airflow.

Return type

dict

model_config_from_estimator

sagemaker.workflow.airflow.model_config_from_estimator(estimator, task_id, task_type, instance_type=None, role=None, image_uri=None, name=None, model_server_workers=None, vpc_config_override='VPC_CONFIG_DEFAULT')

Export Airflow model config from a SageMaker estimator

Parameters
  • estimator (sagemaker.model.EstimatorBase) – The SageMaker estimator to export Airflow config from. It has to be an estimator associated with a training job.

  • task_id (str) – The task id of any airflow.contrib.operators.SageMakerTrainingOperator or airflow.contrib.operators.SageMakerTuningOperator that generates training jobs in the DAG. The model config is built based on the training job generated in this operator.

  • task_type (str) – Whether the task is from SageMakerTrainingOperator or SageMakerTuningOperator. Values can be ‘training’, ‘tuning’ or None (which means training job is not from any task).

  • instance_type (str) – The EC2 instance type to deploy this Model to. For example, ‘ml.p2.xlarge’

  • role (str) – The ExecutionRoleArn IAM Role ARN for the model

  • image_uri (str) – A Docker image URI to use for deploying the model

  • name (str) – Name of the model

  • model_server_workers (int) – The number of worker processes used by the inference server. If None, server will use one worker per vCPU. Only effective when estimator is a SageMaker framework.

  • vpc_config_override (dict[str, list[str]]) – Override for VpcConfig set on the model. Default: use subnets and security groups from this Estimator. * ‘Subnets’ (list[str]): List of subnet ids. * ‘SecurityGroupIds’ (list[str]): List of security group ids.

Returns

Model config that can be directly used by SageMakerModelOperator in Airflow. It can

also be part of the config used by SageMakerEndpointOperator in Airflow.

Return type

dict

transform_config

sagemaker.workflow.airflow.transform_config(transformer, data, data_type='S3Prefix', content_type=None, compression_type=None, split_type=None, job_name=None, input_filter=None, output_filter=None, join_source=None)

Export Airflow transform config from a SageMaker transformer

Parameters
  • transformer (sagemaker.transformer.Transformer) – The SageMaker transformer to export Airflow config from.

  • data (str) – Input data location in S3.

  • data_type (str) –

    What the S3 location defines (default: ‘S3Prefix’). Valid values:

    • ’S3Prefix’ - the S3 URI defines a key name prefix. All objects with this prefix will

      be used as inputs for the transform job.

    • ’ManifestFile’ - the S3 URI points to a single manifest file listing each S3 object

      to use as an input for the transform job.

  • content_type (str) – MIME type of the input data (default: None).

  • compression_type (str) – Compression type of the input data, if compressed (default: None). Valid values: ‘Gzip’, None.

  • split_type (str) – The record delimiter for the input object (default: ‘None’). Valid values: ‘None’, ‘Line’, ‘RecordIO’, and ‘TFRecord’.

  • job_name (str) – job name (default: None). If not specified, one will be generated.

  • input_filter (str) – A JSONPath to select a portion of the input to pass to the algorithm container for inference. If you omit the field, it gets the value ‘$’, representing the entire input. For CSV data, each row is taken as a JSON array, so only index-based JSONPaths can be applied, e.g. $[0], $[1:]. CSV data should follow the RFC format. See Supported JSONPath Operators for a table of supported JSONPath operators. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: “$[1:]”, “$.features” (default: None).

  • output_filter (str) –

    A JSONPath to select a portion of the joined/original output to return as the output. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: “$[1:]”, “$.prediction” (default: None).

  • join_source (str) – The source of data to be joined to the transform output. It can be set to ‘Input’ meaning the entire input record will be joined to the inference result. You can use OutputFilter to select the useful portion before uploading to S3. (default: None). Valid values: Input, None.

Returns

Transform config that can be directly used by SageMakerTransformOperator in Airflow.

Return type

dict

transform_config_from_estimator

sagemaker.workflow.airflow.transform_config_from_estimator(estimator, task_id, task_type, instance_count, instance_type, data, data_type='S3Prefix', content_type=None, compression_type=None, split_type=None, job_name=None, model_name=None, strategy=None, assemble_with=None, output_path=None, output_kms_key=None, accept=None, env=None, max_concurrent_transforms=None, max_payload=None, tags=None, role=None, volume_kms_key=None, model_server_workers=None, image_uri=None, vpc_config_override=None, input_filter=None, output_filter=None, join_source=None)

Export Airflow transform config from a SageMaker estimator

Parameters
  • estimator (sagemaker.model.EstimatorBase) – The SageMaker estimator to export Airflow config from. It has to be an estimator associated with a training job.

  • task_id (str) – The task id of any airflow.contrib.operators.SageMakerTrainingOperator or airflow.contrib.operators.SageMakerTuningOperator that generates training jobs in the DAG. The transform config is built based on the training job generated in this operator.

  • task_type (str) – Whether the task is from SageMakerTrainingOperator or SageMakerTuningOperator. Values can be ‘training’, ‘tuning’ or None (which means training job is not from any task).

  • instance_count (int) – Number of EC2 instances to use.

  • instance_type (str) – Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.

  • data (str) – Input data location in S3.

  • data_type (str) –

    What the S3 location defines (default: ‘S3Prefix’). Valid values:

    • ’S3Prefix’ - the S3 URI defines a key name prefix. All objects with this prefix will

      be used as inputs for the transform job.

    • ’ManifestFile’ - the S3 URI points to a single manifest file listing each S3 object

      to use as an input for the transform job.

  • content_type (str) – MIME type of the input data (default: None).

  • compression_type (str) – Compression type of the input data, if compressed (default: None). Valid values: ‘Gzip’, None.

  • split_type (str) – The record delimiter for the input object (default: ‘None’). Valid values: ‘None’, ‘Line’, ‘RecordIO’, and ‘TFRecord’.

  • job_name (str) – transform job name (default: None). If not specified, one will be generated.

  • model_name (str) – model name (default: None). If not specified, one will be generated.

  • strategy (str) – The strategy used to decide how to batch records in a single request (default: None). Valid values: ‘MultiRecord’ and ‘SingleRecord’.

  • assemble_with (str) – How the output is assembled (default: None). Valid values: ‘Line’ or ‘None’.

  • output_path (str) – S3 location for saving the transform result. If not specified, results are stored to a default bucket.

  • output_kms_key (str) – Optional. KMS key ID for encrypting the transform output (default: None).

  • accept (str) – The accept header passed by the client to the inference endpoint. If it is supported by the endpoint, it will be the format of the batch transform output.

  • env (dict) – Environment variables to be set for use during the transform job (default: None).

  • max_concurrent_transforms (int) – The maximum number of HTTP requests to be made to each individual transform container at one time.

  • max_payload (int) – Maximum size of the payload in a single HTTP request to the container in MB.

  • tags (list[dict]) – List of tags for labeling a transform job. If none specified, then the tags used for the training job are used for the transform job.

  • role (str) – The ExecutionRoleArn IAM Role ARN for the Model, which is also used during transform jobs. If not specified, the role from the Estimator will be used.

  • volume_kms_key (str) – Optional. KMS key ID for encrypting the volume attached to the ML compute instance (default: None).

  • model_server_workers (int) – Optional. The number of worker processes used by the inference server. If None, server will use one worker per vCPU.

  • image_uri (str) – A Docker image URI to use for deploying the model

  • vpc_config_override (dict[str, list[str]]) –

    Override for VpcConfig set on the model. Default: use subnets and security groups from this Estimator.

    • ’Subnets’ (list[str]): List of subnet ids.

    • ’SecurityGroupIds’ (list[str]): List of security group ids.

  • input_filter (str) –

    A JSONPath to select a portion of the input to pass to the algorithm container for inference. If you omit the field, it gets the value ‘$’, representing the entire input. For CSV data, each row is taken as a JSON array, so only index-based JSONPaths can be applied, e.g. $[0], $[1:]. CSV data should follow the RFC format. See Supported JSONPath Operators for a table of supported JSONPath operators. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: “$[1:]”, “$.features” (default: None).

  • output_filter (str) –

    A JSONPath to select a portion of the joined/original output to return as the output. For more information, see the SageMaker API documentation for CreateTransformJob. Some examples: “$[1:]”, “$.prediction” (default: None).

  • join_source (str) – The source of data to be joined to the transform output. It can be set to ‘Input’ meaning the entire input record will be joined to the inference result. You can use OutputFilter to select the useful portion before uploading to S3. (default: None). Valid values: Input, None.

Returns

Transform config that can be directly used by SageMakerTransformOperator in Airflow.

Return type

dict

deploy_config

sagemaker.workflow.airflow.deploy_config(model, initial_instance_count, instance_type, endpoint_name=None, tags=None)

Export Airflow deploy config from a SageMaker model

Parameters
  • model (sagemaker.model.Model) – The SageMaker model to export the Airflow config from.

  • initial_instance_count (int) – The initial number of instances to run in the Endpoint created from this Model.

  • instance_type (str) – The EC2 instance type to deploy this Model to. For example, ‘ml.p2.xlarge’.

  • endpoint_name (str) – The name of the endpoint to create (default: None). If not specified, a unique endpoint name will be created.

  • tags (list[dict]) – List of tags for labeling a training job. For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

Returns

Deploy config that can be directly used by SageMakerEndpointOperator in Airflow.

Return type

dict

deploy_config_from_estimator

sagemaker.workflow.airflow.deploy_config_from_estimator(estimator, task_id, task_type, initial_instance_count, instance_type, model_name=None, endpoint_name=None, tags=None, **kwargs)

Export Airflow deploy config from a SageMaker estimator

Parameters
  • estimator (sagemaker.model.EstimatorBase) – The SageMaker estimator to export Airflow config from. It has to be an estimator associated with a training job.

  • task_id (str) – The task id of any airflow.contrib.operators.SageMakerTrainingOperator or airflow.contrib.operators.SageMakerTuningOperator that generates training jobs in the DAG. The endpoint config is built based on the training job generated in this operator.

  • task_type (str) – Whether the task is from SageMakerTrainingOperator or SageMakerTuningOperator. Values can be ‘training’, ‘tuning’ or None (which means training job is not from any task).

  • initial_instance_count (int) – Minimum number of EC2 instances to deploy to an endpoint for prediction.

  • instance_type (str) – Type of EC2 instance to deploy to an endpoint for prediction, for example, ‘ml.c4.xlarge’.

  • model_name (str) – Name to use for creating an Amazon SageMaker model. If not specified, one will be generated.

  • endpoint_name (str) – Name to use for creating an Amazon SageMaker endpoint. If not specified, the name of the SageMaker model is used.

  • tags (list[dict]) – List of tags for labeling a training job. For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • **kwargs – Passed to invocation of create_model(). Implementations may customize create_model() to accept **kwargs to customize model creation during deploy. For more, see the implementation docs.

Returns

Deploy config that can be directly used by SageMakerEndpointOperator in Airflow.

Return type

dict