Profiler¶
Amazon SageMaker Profiler provides full visibility into provisioned compute resources for training state-of-the-art deep learning models. The following SageMaker Profiler classes are for activating SageMaker Profiler while creating an estimator object of :class:sagemaker.pytorch.estimator.PyTorch or :class:sagemaker.tensorflow.estimator.TensorFlow.
Contents
Profiler configuration modules¶
- class sagemaker.Profiler(cpu_profiling_duration=3600)¶
A configuration class to activate Amazon SageMaker Profiler.
To adjust the Profiler configuration instead of using the default configuration, use the following parameters.
Parameters:
cpu_profiling_duration (str): Specify the time duration in seconds for profiling CPU activities. The default value is 3600 seconds.
Example usage:
import sagemaker from sagemaker.pytorch import PyTorch from sagemaker import ProfilerConfig, Profiler profiler_config = ProfilerConfig( profiler_params = Profiler(cpu_profiling_duration=3600) ) estimator = PyTorch( framework_version="2.0.0", ... # Set up other essential parameters for the estimator class profiler_config=profiler_config )
For a complete instruction on activating and using SageMaker Profiler, see Use Amazon SageMaker Profiler to profile activities on AWS compute resources.
- class sagemaker.ProfilerConfig(s3_output_path=None, system_monitor_interval_millis=None, framework_profile_params=None, profile_params=None, disable_profiler=False)¶
Configuration for collecting system and framework metrics of SageMaker training jobs.
SageMaker Debugger collects system and framework profiling information of training jobs and identify performance bottlenecks.
Initialize a
ProfilerConfig
instance.Pass the output of this class to the
profiler_config
parameter of the genericEstimator
class and SageMaker Framework estimators.- Parameters
s3_output_path (str or PipelineVariable) – The location in Amazon S3 to store the output. The default Debugger output path for profiling data is created under the default output path of the
Estimator
class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.system_monitor_interval_millis (int or PipelineVariable) – The time interval in milliseconds to collect system metrics. Available values are 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds. The default is 500 milliseconds.
framework_profile_params (
FrameworkProfile
) – (Deprecated) A parameter object for framework metrics profiling. Configure it using theFrameworkProfile
class. To use the default framework profile parameters, passFrameworkProfile()
. For more information about the default values, seeFrameworkProfile
.disable_profiler (bool) – Switch the basic monitoring on or off using this parameter. The default is
False
.profile_params (dict or an object of
sagemaker.Profiler
) – Pass this parameter to activate SageMaker Profiler using thesagemaker.Profiler
class.
Basic profiling using SageMaker Debugger
By default, if you submit training jobs using SageMaker Python SDK’s estimator classes, SageMaker runs basic profiling automatically. The following example shows the basic profiling configuration that you can utilize to update the time interval for collecting system resource utilization.
import sagemaker from sagemaker.pytorch import PyTorch from sagemaker.debugger import ProfilerConfig profiler_config = ProfilerConfig( system_monitor_interval_millis = 500 ) estimator = PyTorch( framework_version="2.0.0", ... # Set up other essential parameters for the estimator class profiler_config=profiler_config )
For a complete instruction on activating and using SageMaker Debugger, see Monitor AWS compute resource utilization in Amazon SageMaker Studio.
Deep profiling using SageMaker Profiler
The following example shows an example configration for activating SageMaker Profiler.
import sagemaker from sagemaker.pytorch import PyTorch from sagemaker import ProfilerConfig, Profiler profiler_config = ProfilerConfig( profiler_params = Profiler(cpu_profiling_duration=3600) ) estimator = PyTorch( framework_version="2.0.0", ... # Set up other essential parameters for the estimator class profiler_config=profiler_config )
For a complete instruction on activating and using SageMaker Profiler, see Use Amazon SageMaker Profiler to profile activities on AWS compute resources.
Profiler Rule APIs¶
The following API is for setting up SageMaker Debugger’s profiler rules to detect computational performance issues from training jobs.
- class sagemaker.debugger.ProfilerRule(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters)¶
The SageMaker Debugger ProfilerRule class configures profiling rules.
SageMaker Debugger profiling rules automatically analyze hardware system resource utilization and framework metrics of a training job to identify performance bottlenecks.
SageMaker Debugger comes pre-packaged with built-in profiling rules. For example, the profiling rules can detect if GPUs are underutilized due to CPU bottlenecks or IO bottlenecks. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules. You can also write your own profiling rules using the Amazon SageMaker Debugger APIs.
Tip
Use the following
ProfilerRule.sagemaker
class method for built-in profiling rules or theProfilerRule.custom
class method for custom profiling rules. Do not directly use the Rule initialization method.Method generated by attrs for class RuleBase.
- classmethod sagemaker(base_config, name=None, container_local_output_path=None, s3_output_path=None, instance_type=None, volume_size_in_gb=None)¶
Initialize a
ProfilerRule
object for a built-in profiling rule.The rule analyzes system and framework metrics of a given training job to identify performance bottlenecks.
- Parameters
base_config (rule_configs.ProfilerRule) –
The base rule configuration object returned from the
rule_configs
method. For example, ‘rule_configs.ProfilerReport()’. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules.name (str) – The name of the profiler rule. If one is not provided, the name of the base_config will be used.
container_local_output_path (str) – The path in the container.
s3_output_path (str) – The location in Amazon S3 to store the profiling output data. The default Debugger output path for profiling data is created under the default output path of the
Estimator
class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.
- Returns
The instance of the built-in ProfilerRule.
- Return type
- classmethod custom(name, image_uri, instance_type, volume_size_in_gb, source=None, rule_to_invoke=None, container_local_output_path=None, s3_output_path=None, rule_parameters=None)¶
Initialize a
ProfilerRule
object for a custom profiling rule.You can create a rule that analyzes system and framework metrics emitted during the training of a model and monitors conditions that are critical for the success of a training job.
- Parameters
name (str) – The name of the profiler rule.
image_uri (str) – The URI of the image to be used by the proflier rule.
instance_type (str) – Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.
volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data.
source (str) – A source file containing a rule to invoke. If provided, you must also provide rule_to_invoke. This can either be an S3 uri or a local path.
rule_to_invoke (str) – The name of the rule to invoke within the source. If provided, you must also provide the source.
container_local_output_path (str) – The path in the container.
s3_output_path (str) – The location in Amazon S3 to store the output. The default Debugger output path for profiling data is created under the default output path of the
Estimator
class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.rule_parameters (dict) – A dictionary of parameters for the rule.
- Returns
The instance of the custom ProfilerRule.
- Return type
Debugger Configuration APIs for Framework Profiling (Deprecated)¶
Warning
In favor of Amazon SageMaker Profiler, SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.
SageMaker Python SDK <= v2.130.0
PyTorch >= v1.6.0, < v2.0
TensorFlow >= v2.3.1, < v2.11
With the deprecation, SageMaker Debugger discontinues support for the APIs below this note.
See also Amazon SageMaker Debugger Release Notes: March 16, 2023.
- class sagemaker.debugger.FrameworkProfile(local_path='/opt/ml/output/profiler', file_max_size=10485760, file_close_interval=60, file_open_fail_threshold=50, detailed_profiling_config=None, dataloader_profiling_config=None, python_profiling_config=None, horovod_profiling_config=None, smdataparallel_profiling_config=None, start_step=None, num_steps=None, start_unix_time=None, duration=None)¶
Bases:
object
Sets up the profiling configuration for framework metrics.
Validates user inputs and fills in default values if no input is provided. There are three main profiling options to choose from:
DetailedProfilingConfig
,DataloaderProfilingConfig
, andPythonProfilingConfig
.The following list shows available scenarios of configuring the profiling options.
1. None of the profiling configuration, step range, or time range is specified. SageMaker Debugger activates framework profiling based on the default settings of each profiling option.
from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( framework_profile_params=FrameworkProfile() )
2. Target step or time range is specified to this
FrameworkProfile
class. The requested target step or time range setting propagates to all of the framework profiling options. For example, if you configure this class as following, all of the profiling options profiles the 6th step:from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( framework_profile_params=FrameworkProfile(start_step=6, num_steps=1) )
3. Individual profiling configurations are specified through the
*_profiling_config
parameters. SageMaker Debugger profiles framework metrics only for the specified profiling configurations. For example, if theDetailedProfilingConfig
class is configured but not the other profiling options, Debugger only profiles based on the settings specified to theDetailedProfilingConfig
class. For example, the following example shows a profiling configuration to perform detailed profiling at step 10, data loader profiling at step 9 and 10, and Python profiling at step 12.from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( framework_profile_params=FrameworkProfile( detailed_profiling_config=DetailedProfilingConfig(start_step=10, num_steps=1), dataloader_profiling_config=DataloaderProfilingConfig(start_step=9, num_steps=2), python_profiling_config=PythonProfilingConfig(start_step=12, num_steps=1), ) )
If the individual profiling configurations are specified in addition to the step or time range, SageMaker Debugger prioritizes the individual profiling configurations and ignores the step or time range. For example, in the following code, the
start_step=1
andnum_steps=10
will be ignored.from sagemaker.debugger import ProfilerConfig, FrameworkProfile profiler_config=ProfilerConfig( framework_profile_params=FrameworkProfile( start_step=1, num_steps=10, detailed_profiling_config=DetailedProfilingConfig(start_step=10, num_steps=1), dataloader_profiling_config=DataloaderProfilingConfig(start_step=9, num_steps=2), python_profiling_config=PythonProfilingConfig(start_step=12, num_steps=1) ) )
Initialize the FrameworkProfile class object.
- Parameters
detailed_profiling_config (DetailedProfilingConfig) –
The configuration for detailed profiling. Configure it using the
DetailedProfilingConfig
class. PassDetailedProfilingConfig()
to use the default configuration.Warning
This detailed framework profiling feature discontinues support for TensorFlow v2.11 and later. To use the detailed profiling feature, use previous versions of TensorFlow between v2.3.1 and v2.10.0.
dataloader_profiling_config (DataloaderProfilingConfig) – The configuration for dataloader metrics profiling. Configure it using the
DataloaderProfilingConfig
class. PassDataloaderProfilingConfig()
to use the default configuration.python_profiling_config (PythonProfilingConfig) – The configuration for stats collected by the Python profiler (cProfile or Pyinstrument). Configure it using the
PythonProfilingConfig
class. PassPythonProfilingConfig()
to use the default configuration.start_step (int) – The step at which to start profiling.
num_steps (int) – The number of steps to profile.
start_unix_time (int) – The Unix time at which to start profiling.
duration (float) – The duration in seconds to profile.
Tip
Available profiling range parameter pairs are (start_step and num_steps) and (start_unix_time and duration). The two parameter pairs are mutually exclusive, and this class validates if one of the two pairs is used. If both pairs are specified, a conflict error occurs.
- class sagemaker.debugger.DetailedProfilingConfig(start_step=None, num_steps=None, start_unix_time=None, duration=None, profile_default_steps=False)¶
Bases:
MetricsConfigBase
The configuration for framework metrics to be collected for detailed profiling.
Specify target steps or a target duration to profile.
By default, it profiles step 5 of the training job.
If profile_default_steps is set to True and none of the other range parameters is specified, the class uses the default configuration for detailed profiling.
- Parameters
start_step (int) – The step to start profiling. The default is step 5.
num_steps (int) – The number of steps to profile. The default is for 1 step.
start_unix_time (int) – The Unix time to start profiling.
duration (float) – The duration in seconds to profile.
profile_default_steps (bool) – Indicates whether the default config should be used.
Tip
Available profiling range parameter pairs are (start_step and num_steps) and (start_unix_time and duration). The two parameter pairs are mutually exclusive, and this class validates if one of the two pairs is used. If both pairs are specified, a conflict error occurs.
Warning
This detailed framework profiling feature discontinues support for TensorFlow v2.11 and later. To use the detailed profiling feature, use previous versions of TensorFlow between v2.3.1 and v2.10.0.
- class sagemaker.debugger.DataloaderProfilingConfig(start_step=None, num_steps=None, start_unix_time=None, duration=None, profile_default_steps=False, metrics_regex='.*')¶
Bases:
MetricsConfigBase
The configuration for framework metrics to be collected for data loader profiling.
Specify target steps or a target duration to profile.
By default, it profiles step 7 of training. If profile_default_steps is set to True and none of the other range parameters is specified, the class uses the default config for dataloader profiling.
- Parameters
start_step (int) – The step to start profiling. The default is step 7.
num_steps (int) – The number of steps to profile. The default is for 1 step.
start_unix_time (int) – The Unix time to start profiling. The default is for 1 step.
duration (float) – The duration in seconds to profile.
profile_default_steps (bool) – Indicates whether the default config should be used.
- class sagemaker.debugger.PythonProfilingConfig(start_step=None, num_steps=None, start_unix_time=None, duration=None, profile_default_steps=False, python_profiler=PythonProfiler.CPROFILE, cprofile_timer=cProfileTimer.TOTAL_TIME)¶
Bases:
MetricsConfigBase
The configuration for framework metrics to be collected for Python profiling.
Choose a Python profiler: cProfile or Pyinstrument.
Specify target steps or a target duration to profile. If no parameter is specified, it profiles based on profiling configurations preset by the profile_default_steps parameter, which is set to True by default. If you specify the following parameters, then the profile_default_steps parameter will be ignored.
- Parameters
start_step (int) – The step to start profiling. The default is step 9.
num_steps (int) – The number of steps to profile. The default is for 3 steps.
start_unix_time (int) – The Unix time to start profiling.
duration (float) – The duration in seconds to profile.
profile_default_steps (bool) – Indicates whether the default configuration should be used. If set to True, Python profiling will be done at step 9, 10, and 11 of training, using cProfiler and collecting metrics based on the total time, cpu time, and off cpu time for these three steps respectively. The default is
True
.python_profiler (PythonProfiler) – The Python profiler to use to collect python profiling stats. Available options are
"cProfile"
and"Pyinstrument"
. The default is"cProfile"
. Instead of passing the string values, you can also use the enumerator util,PythonProfiler
, to choose one of the available options.cprofile_timer (cProfileTimer) – The timer to be used by cProfile when collecting python profiling stats. Available options are
"total_time"
,"cpu_time"
, and"off_cpu_time"
. The default is"total_time"
. If you choose Pyinstrument, this parameter is ignored. Instead of passing the string values, you can also use the enumerator util,cProfileTimer
, to choose one of the available options.
- class sagemaker.debugger.PythonProfiler(value)¶
Bases:
Enum
Enum to list the Python profiler options for Python profiling.
- CPROFILE¶
Use to choose
"cProfile"
.
- PYINSTRUMENT¶
Use to choose
"Pyinstrument"
.
- class sagemaker.debugger.cProfileTimer(value)¶
Bases:
Enum
Enum to list the possible cProfile timers for Python profiling.
- TOTAL_TIME¶
Use to choose
"total_time"
.
- CPU_TIME¶
Use to choose
"cpu_time"
.
- OFF_CPU_TIME¶
Use to choose
"off_cpu_time"
.
The various types of metrics configurations that can be specified in FrameworkProfile.
- class sagemaker.debugger.metrics_config.StepRange(start_step, num_steps)¶
Configuration for the range of steps to profile.
It returns the target steps in dictionary format that you can pass to the
FrameworkProfile
class.Set the start step and num steps.
If the start step is not specified, Debugger starts profiling at step 0. If num steps is not specified, profile for 1 step.
- Parameters
- class sagemaker.debugger.metrics_config.TimeRange(start_unix_time, duration)¶
Configuration for the range of Unix time to profile.
It returns the target time duration in dictionary format that you can pass to the
FrameworkProfile
class.Set the start Unix time and duration.
If the start Unix time is not specified, profile starting at step 0. If the duration is not specified, profile for 1 step.
- Parameters