Debugger

Amazon SageMaker Debugger provides a full visibility into training jobs of state-of-the-art machine learning models. This module provides SageMaker Debugger high-level methods to set up Debugger objects, such as Debugger built-in rules, tensor collections, and hook configuration. Use the Debugger objects for parameters when constructing a SageMaker estimator to initiate a training job.

sagemaker.debugger.get_rule_container_image_uri(region)

Returns the Debugger rule image URI for the given AWS region. For a full list of rule image URIs, see Use Debugger Docker Images for Built-in or Custom Rules.

Parameters

region (str) – A string of AWS Region. For example, 'us-east-1'.

Returns

Formatted image URI for the given region and the rule container type.

Return type

str

class sagemaker.debugger.Rule(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters, collections_to_save)

Bases: object

Debugger rules analyze tensors emitted while training jobs are running. The rules monitor conditions that are critical for success of your training job.

Use the following Rule.sagemaker class method for built-in rules or the Rule.custom class method for custom rules. Do not directly use the Rule initialization method.

classmethod sagemaker(base_config, name=None, container_local_output_path=None, s3_output_path=None, other_trials_s3_input_paths=None, rule_parameters=None, collections_to_save=None)

Initialize a Rule processing job for a built-in SageMaker Debugging Rule. The built-in rule analyzes tensors emitted during the training of a model and monitors conditions that are critical for the success of the training job.

Parameters
  • base_config (dict) – Required. This is the base rule config dictionary returned from the rule_configs method. For example, rule_configs.dead_relu().

  • name (str) – Optional. The name of the debugger rule. If one is not provided, the name of the base_config will be used.

  • container_local_output_path (str) – Optional. The local path in the rule processing container.

  • s3_output_path (str) – Optional. The location in S3 to store the output tensors. The default Debugger output path is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-111122223333/<training-job-name>/debug-output/.

  • other_trials_s3_input_paths ([str]) – Optional. S3 input paths for other trials.

  • rule_parameters (dict) – Optional. A dictionary of parameters for the rule.

  • collections_to_save ([sagemaker.debugger.CollectionConfig]) – Optional. A list of CollectionConfig objects to be saved.

Returns

The instance of the built-in rule.

Return type

sagemaker.debugger.Rule

Example of creating a built-in rule instance:

from sagemaker.debugger import Rule, rule_configs

built_in_rules = [
    Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_1()),
    Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_2()),
    ...
    Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_n())
]

You need to replace the built_in_rule_name_in_pysdk_format_* with the names of built-in rules. You can find the rule names at List of Debugger Built-in Rules.

Example of creating a built-in rule instance with adjusting parameter values:

from sagemaker.debugger import Rule, rule_configs

built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule_name_in_pysdk_format(),
        rule_parameters={
                "key": "value"
        }
        collections_to_save=[
            CollectionConfig(
                name="tensor_collection_name",
                parameters={
                    "key": "value"
                }
            )
        ]
    )
]

For more information about setting up the rule_parameters parameter, see List of Debugger Built-in Rules.

For more information about setting up the collections_to_save parameter, see the CollectionConfig class.

classmethod custom(name, image_uri, instance_type, volume_size_in_gb, source=None, rule_to_invoke=None, container_local_output_path=None, s3_output_path=None, other_trials_s3_input_paths=None, rule_parameters=None, collections_to_save=None)

Initialize a Rule processing job for a custom SageMaker Debugging Rule. The custom rule analyzes tensors emitted during the training of a model and monitors conditions that are critical for the success of a training job. For more information, see Create Debugger Custom Rules for Training Job Analysis

Parameters
  • name (str) – Required. The name of the debugger rule.

  • image_uri (str) – Required. The URI of the image to be used by the debugger rule.

  • instance_type (str) – Required. Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.

  • volume_size_in_gb (int) – Required. Size in GB of the EBS volume to use for storing data.

  • source (str) – Optional. A source file containing a rule to invoke. If provided, you must also provide rule_to_invoke. This can either be an S3 uri or a local path.

  • rule_to_invoke (str) – Optional. The name of the rule to invoke within the source. If provided, you must also provide source.

  • container_local_output_path (str) – Optional. The local path in the container.

  • s3_output_path (str) – Optional. The location in S3 to store the output tensors. The default Debugger output path is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-111122223333/<training-job-name>/debug-output/.

  • other_trials_s3_input_paths ([str]) – Optional. S3 input paths for other trials.

  • rule_parameters (dict) – Optional. A dictionary of parameters for the rule.

  • collections_to_save ([sagemaker.debugger.CollectionConfig]) – Optional. A list of CollectionConfig objects to be saved.

Returns

The instance of the custom Rule.

Return type

sagemaker.debugger.Rule

to_debugger_rule_config_dict()

Generates a request dictionary using the parameters provided when initializing the object.

Returns

An portion of an API request as a dictionary.

Return type

dict

class sagemaker.debugger.DebuggerHookConfig(s3_output_path=None, container_local_output_path=None, hook_parameters=None, collection_configs=None)

Bases: object

Initialize an instance of DebuggerHookConfig. DebuggerHookConfig provides options to customize how debugging information is emitted and saved. This high-level DebuggerHookConfig class runs based on the smdebug.SaveConfig class.

Parameters
  • s3_output_path (str) – Optional. The location in S3 to store the output tensors. The default Debugger output path is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-111122223333/<training-job-name>/debug-output/.

  • container_local_output_path (str) – Optional. The local path in the container.

  • hook_parameters (dict) – Optional. A dictionary of parameters.

  • collection_configs ([sagemaker.debugger.CollectionConfig]) – Required. A list of CollectionConfig objects to be saved at the s3_output_path.

Example of creating a DebuggerHookConfig object:

from sagemaker.debugger import CollectionConfig, DebuggerHookConfig

collection_configs=[
    CollectionConfig(name="tensor_collection_1")
    CollectionConfig(name="tensor_collection_2")
    ...
    CollectionConfig(name="tensor_collection_n")
]

hook_config = DebuggerHookConfig(
    collection_configs=collection_configs
)
class sagemaker.debugger.TensorBoardOutputConfig(s3_output_path, container_local_output_path=None)

Bases: object

A TensorBoard ouput configuration object to provide options to customize debugging visualizations using TensorBoard.

Parameters
  • s3_output_path (str) – Optional. The location in S3 to store the output.

  • container_local_output_path (str) – Optional. The local path in the container.

class sagemaker.debugger.CollectionConfig(name, parameters=None)

Bases: object

Creates tensor collections for SageMaker Debugger.

Constructor for collection configuration.

Parameters
  • name (str) – Required. The name of the collection configuration.

  • parameters (dict) – Optional. The parameters for the collection configuration.

Example of creating a CollectionConfig object:

from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(name="tensor_collection_1")
    CollectionConfig(name="tensor_collection_2")
    ...
    CollectionConfig(name="tensor_collection_n")
]

For a full list of Debugger built-in collection, see Debugger Built in Collections.

Example of creating a CollectionConfig object with parameter adjustment:

You can use the following CollectionConfig template in two ways: (1) to adjust the parameters of the built-in tensor collections, and (2) to create custom tensor collections.

If you put the built-in collection names to the name parameter, CollectionConfig takes it to match the built-in collections and adjust parameters. If you specify a new name to the name parameter, CollectionConfig creates a new tensor collection, and you must use include_regex parameter to specify regex of tensors you want to collect.

from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(
        name="tensor_collection",
        parameters={
            "key_1": "value_1",
            "key_2": "value_2"
            ...
            "key_n": "value_n"
        }
    )
]

The following list shows the available CollectionConfig parameters.

Parameter Key

Descriptions

include_regex

Specify a list of regex patterns of tensors to save.

Tensors whose names match these patterns will be saved.

save_histogram

Set True if want to save histogram output data for

TensorFlow visualization.

reductions

Specify certain reduction values of tensors.

This helps reduce the amount of data saved and

increase training speed.

Available values are min, max, median,

mean, std, variance, sum, and prod.

save_interval

train.save_interval

eval.save_interval

predict.save_interval

global.save_interval

Specify how often to save tensors in steps.

You can also specify the save intervals

in TRAIN, EVAL, PREDICT, and GLOBAL modes.

The default value is 500 steps.

save_steps

train.save_steps

eval.save_steps

predict.save_steps

global.save_steps

Specify the exact step numbers to save tensors.

You can also specify the save steps

in TRAIN, EVAL, PREDICT, and GLOBAL modes.

start_step

train.start_step

eval.start_step

predict.start_step

global.start_step

Specify the exact start step to save tensors.

You can also specify the start steps

in TRAIN, EVAL, PREDICT, and GLOBAL modes.

end_step

train.end_step

eval.end_step

predict.end_step

global.end_step

Specify the exact end step to save tensors.

You can also specify the end steps

in TRAIN, EVAL, PREDICT, and GLOBAL modes.

For example, the following code shows how to control the save interval parameters of the built-in losses tensor collection.

collection_configs=[
    CollectionConfig(
        name="losses",
        parameters={
            "train.save_interval": "100",
            "eval.save_interval": "10"
        }
    )
]