Debugger¶
Amazon SageMaker Debugger provides full visibility into training jobs of state-of-the-art machine learning models. This SageMaker Debugger module provides high-level methods to set up Debugger configurations to monitor, profile, and debug your training job. Configure the Debugger-specific parameters when constructing a SageMaker estimator to gain visibility and insights into your training job.
Debugger Rule APIs¶
- class sagemaker.debugger.get_rule_container_image_uri(name, region)¶
Bases:
Return the Debugger rule image URI for the given AWS Region.
For a full list of rule image URIs, see Use Debugger Docker Images for Built-in or Custom Rules.
- class sagemaker.debugger.get_default_profiler_processing_job(instance_type=None, volume_size_in_gb=None)¶
Bases:
Return the default profiler processing job (a rule) with a unique name.
- Returns
The instance of the built-in ProfilerRule.
- Return type
- class sagemaker.debugger.rule_configs¶
A helper module to configure the SageMaker Debugger built-in rules with the
Rule
classmethods and and theProfilerRule
classmethods.For a full list of built-in rules, see List of Debugger Built-in Rules.
This module is imported from the Debugger client library for rule configuration. For more information, see Amazon SageMaker Debugger RulesConfig.
- class sagemaker.debugger.RuleBase(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters)¶
Bases:
ABC
The SageMaker Debugger rule base class that cannot be instantiated directly.
Tip
Debugger rule classes inheriting this RuleBase class are
Rule
andProfilerRule
. Do not directly use the rule base class to instantiate a SageMaker Debugger rule. Use theRule
classmethods for debugging and theProfilerRule
classmethods for profiling.Method generated by attrs for class RuleBase.
- class sagemaker.debugger.Rule(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters, collections_to_save, actions=None)¶
Bases:
RuleBase
The SageMaker Debugger Rule class configures debugging rules to debug your training job.
The debugging rules analyze tensor outputs from your training job and monitor conditions that are critical for the success of the training job.
SageMaker Debugger comes pre-packaged with built-in debugging rules. For example, the debugging rules can detect whether gradients are getting too large or too small, or if a model is overfitting. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules. You can also write your own rules using the custom rule classmethod.
Configure the debugging rules using the following classmethods.
Tip
Use the following
Rule.sagemaker
class method for built-in debugging rules or theRule.custom
class method for custom debugging rules. Do not directly use theRule
initialization method.- classmethod sagemaker(base_config, name=None, container_local_output_path=None, s3_output_path=None, other_trials_s3_input_paths=None, rule_parameters=None, collections_to_save=None, actions=None)¶
Initialize a
Rule
object for a built-in debugging rule.- Parameters
base_config (dict) –
Required. This is the base rule config dictionary returned from the
rule_configs
method. For example,rule_configs.dead_relu()
. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules.name (str) – Optional. The name of the debugger rule. If one is not provided, the name of the base_config will be used.
container_local_output_path (str) – Optional. The local path in the rule processing container.
s3_output_path (str) – Optional. The location in Amazon S3 to store the output tensors. The default Debugger output path for debugging data is created under the default output path of the
Estimator
class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.other_trials_s3_input_paths ([str]) – Optional. The Amazon S3 input paths of other trials to use the SimilarAcrossRuns rule.
rule_parameters (dict) – Optional. A dictionary of parameters for the rule.
collections_to_save (
CollectionConfig
) – Optional. A list ofCollectionConfig
objects to be saved.
- Returns
An instance of the built-in rule.
- Return type
Example of how to create a built-in rule instance:
from sagemaker.debugger import Rule, rule_configs built_in_rules = [ Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_1()), Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_2()), ... Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_n()) ]
You need to replace the
built_in_rule_name_in_pysdk_format_*
with the names of built-in rules. You can find the rule names at List of Debugger Built-in Rules.Example of creating a built-in rule instance with adjusting parameter values:
from sagemaker.debugger import Rule, rule_configs built_in_rules = [ Rule.sagemaker( base_config=rule_configs.built_in_rule_name_in_pysdk_format(), rule_parameters={ "key": "value" } collections_to_save=[ CollectionConfig( name="tensor_collection_name", parameters={ "key": "value" } ) ] ) ]
For more information about setting up the
rule_parameters
parameter, see List of Debugger Built-in Rules.For more information about setting up the
collections_to_save
parameter, see theCollectionConfig
class.
- classmethod custom(name, image_uri, instance_type, volume_size_in_gb, source=None, rule_to_invoke=None, container_local_output_path=None, s3_output_path=None, other_trials_s3_input_paths=None, rule_parameters=None, collections_to_save=None, actions=None)¶
Initialize a
Rule
object for a custom debugging rule.You can create a custom rule that analyzes tensors emitted during the training of a model and monitors conditions that are critical for the success of a training job. For more information, see Create Debugger Custom Rules for Training Job Analysis.
- Parameters
name (str) – Required. The name of the debugger rule.
image_uri (str or PipelineVariable) – Required. The URI of the image to be used by the debugger rule.
instance_type (str or PipelineVariable) – Required. Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.
volume_size_in_gb (int or PipelineVariable) – Required. Size in GB of the EBS volume to use for storing data.
source (str) – Optional. A source file containing a rule to invoke. If provided, you must also provide rule_to_invoke. This can either be an S3 uri or a local path.
rule_to_invoke (str or PipelineVariable) – Optional. The name of the rule to invoke within the source. If provided, you must also provide source.
container_local_output_path (str or PipelineVariable) – Optional. The local path in the container.
s3_output_path (str or PipelineVariable) – Optional. The location in Amazon S3 to store the output tensors. The default Debugger output path for debugging data is created under the default output path of the
Estimator
class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.list[PipelineVariable] (other_trials_s3_input_paths (list[str] or) – Optional. The Amazon S3 input paths of other trials to use the SimilarAcrossRuns rule.
rule_parameters (dict[str, str] or dict[str, PipelineVariable]) – Optional. A dictionary of parameters for the rule.
collections_to_save ([sagemaker.debugger.CollectionConfig]) – Optional. A list of
CollectionConfig
objects to be saved.other_trials_s3_input_paths (Optional[List[Union[str, PipelineVariable]]]) –
- Returns
The instance of the custom rule.
- Return type
Debugger Configuration APIs¶
- class sagemaker.debugger.CollectionConfig(name, parameters=None)¶
Bases:
object
Creates tensor collections for SageMaker Debugger.
Constructor for collection configuration.
- Parameters
name (str or PipelineVariable) – Required. The name of the collection configuration.
parameters (dict[str, str] or dict[str, PipelineVariable]) – Optional. The parameters for the collection configuration.
Example of creating a CollectionConfig object:
from sagemaker.debugger import CollectionConfig collection_configs=[ CollectionConfig(name="tensor_collection_1") CollectionConfig(name="tensor_collection_2") ... CollectionConfig(name="tensor_collection_n") ]
For a full list of Debugger built-in collection, see Debugger Built in Collections.
Example of creating a CollectionConfig object with parameter adjustment:
You can use the following CollectionConfig template in two ways: (1) to adjust the parameters of the built-in tensor collections, and (2) to create custom tensor collections.
If you put the built-in collection names to the
name
parameter,CollectionConfig
takes it to match the built-in collections and adjust parameters. If you specify a new name to thename
parameter,CollectionConfig
creates a new tensor collection, and you must useinclude_regex
parameter to specify regex of tensors you want to collect.from sagemaker.debugger import CollectionConfig collection_configs=[ CollectionConfig( name="tensor_collection", parameters={ "key_1": "value_1", "key_2": "value_2" ... "key_n": "value_n" } ) ]
The following list shows the available CollectionConfig parameters.
Parameter Key
Descriptions
include_regex
Specify a list of regex patterns of tensors to save.
Tensors whose names match these patterns will be saved.
save_histogram
Set True if want to save histogram output data for
TensorFlow visualization.
reductions
Specify certain reduction values of tensors.
This helps reduce the amount of data saved and
increase training speed.
Available values are
min
,max
,median
,mean
,std
,variance
,sum
, andprod
.save_interval
train.save_interval
eval.save_interval
predict.save_interval
global.save_interval
Specify how often to save tensors in steps.
You can also specify the save intervals
in TRAIN, EVAL, PREDICT, and GLOBAL modes.
The default value is 500 steps.
save_steps
train.save_steps
eval.save_steps
predict.save_steps
global.save_steps
Specify the exact step numbers to save tensors.
You can also specify the save steps
in TRAIN, EVAL, PREDICT, and GLOBAL modes.
start_step
train.start_step
eval.start_step
predict.start_step
global.start_step
Specify the exact start step to save tensors.
You can also specify the start steps
in TRAIN, EVAL, PREDICT, and GLOBAL modes.
end_step
train.end_step
eval.end_step
predict.end_step
global.end_step
Specify the exact end step to save tensors.
You can also specify the end steps
in TRAIN, EVAL, PREDICT, and GLOBAL modes.
For example, the following code shows how to control the save_interval parameters of the built-in
losses
tensor collection. With the following collection configuration, Debugger collects loss values every 100 steps from training loops and every 10 steps from evaluation loops.collection_configs=[ CollectionConfig( name="losses", parameters={ "train.save_interval": "100", "eval.save_interval": "10" } ) ]
- class sagemaker.debugger.DebuggerHookConfig(s3_output_path=None, container_local_output_path=None, hook_parameters=None, collection_configs=None)¶
Bases:
object
Create a Debugger hook configuration object to save the tensor for debugging.
DebuggerHookConfig provides options to customize how debugging information is emitted and saved. This high-level DebuggerHookConfig class runs based on the smdebug.SaveConfig class.
Initialize the DebuggerHookConfig instance.
- Parameters
s3_output_path (str or PipelineVariable) – Optional. The location in Amazon S3 to store the output tensors. The default Debugger output path is created under the default output path of the
Estimator
class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.container_local_output_path (str or PipelineVariable) – Optional. The local path in the container.
hook_parameters (dict[str, str] or dict[str, PipelineVariable]) – Optional. A dictionary of parameters.
collection_configs ([sagemaker.debugger.CollectionConfig]) – Required. A list of
CollectionConfig
objects to be saved at the s3_output_path.
Example of creating a DebuggerHookConfig object:
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig collection_configs=[ CollectionConfig(name="tensor_collection_1") CollectionConfig(name="tensor_collection_2") ... CollectionConfig(name="tensor_collection_n") ] hook_config = DebuggerHookConfig( collection_configs=collection_configs )
- class sagemaker.debugger.TensorBoardOutputConfig(s3_output_path, container_local_output_path=None)¶
Bases:
object
Create a tensor ouput configuration object for debugging visualizations on TensorBoard.
Initialize the TensorBoardOutputConfig instance.
- Parameters
s3_output_path (str or PipelineVariable) – Optional. The location in Amazon S3 to store the output.
container_local_output_path (str or PipelineVariable) – Optional. The local path in the container.