Profiler

Amazon SageMaker Profiler provides full visibility into provisioned compute resources for training state-of-the-art deep learning models. The following SageMaker Profiler classes are for activating SageMaker Profiler while creating an estimator object of :class:sagemaker.pytorch.estimator.PyTorch or :class:sagemaker.tensorflow.estimator.TensorFlow.

Profiler configuration modules

class sagemaker.Profiler(cpu_profiling_duration=3600)

A configuration class to activate Amazon SageMaker Profiler.

To adjust the Profiler configuration instead of using the default configuration, use the following parameters.

Parameters:

  • cpu_profiling_duration (str): Specify the time duration in seconds for profiling CPU activities. The default value is 3600 seconds.

Example usage:

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import ProfilerConfig, Profiler

profiler_config = ProfilerConfig(
    profiler_params = Profiler(cpu_profiling_duration=3600)
)

estimator = PyTorch(
    framework_version="2.0.0",
    ... # Set up other essential parameters for the estimator class
    profiler_config=profiler_config
)

For a complete instruction on activating and using SageMaker Profiler, see Use Amazon SageMaker Profiler to profile activities on AWS compute resources.

class sagemaker.ProfilerConfig(s3_output_path=None, system_monitor_interval_millis=None, framework_profile_params=None, profile_params=None, disable_profiler=False)

Configuration for collecting system and framework metrics of SageMaker training jobs.

SageMaker Debugger collects system and framework profiling information of training jobs and identify performance bottlenecks.

Initialize a ProfilerConfig instance.

Pass the output of this class to the profiler_config parameter of the generic Estimator class and SageMaker Framework estimators.

Parameters
  • s3_output_path (str or PipelineVariable) – The location in Amazon S3 to store the output. The default Debugger output path for profiling data is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.

  • system_monitor_interval_millis (int or PipelineVariable) – The time interval in milliseconds to collect system metrics. Available values are 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds. The default is 500 milliseconds.

  • framework_profile_params (FrameworkProfile) – (Deprecated) A parameter object for framework metrics profiling. Configure it using the FrameworkProfile class. To use the default framework profile parameters, pass FrameworkProfile(). For more information about the default values, see FrameworkProfile.

  • disable_profiler (bool) – Switch the basic monitoring on or off using this parameter. The default is False.

  • profile_params (dict or an object of sagemaker.Profiler) – Pass this parameter to activate SageMaker Profiler using the sagemaker.Profiler class.

Basic profiling using SageMaker Debugger

By default, if you submit training jobs using SageMaker Python SDK’s estimator classes, SageMaker runs basic profiling automatically. The following example shows the basic profiling configuration that you can utilize to update the time interval for collecting system resource utilization.

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import ProfilerConfig

profiler_config = ProfilerConfig(
    system_monitor_interval_millis = 500
)

estimator = PyTorch(
    framework_version="2.0.0",
    ... # Set up other essential parameters for the estimator class
    profiler_config=profiler_config
)

For a complete instruction on activating and using SageMaker Debugger, see Monitor AWS compute resource utilization in Amazon SageMaker Studio.

Deep profiling using SageMaker Profiler

The following example shows an example configration for activating SageMaker Profiler.

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import ProfilerConfig, Profiler

profiler_config = ProfilerConfig(
    profiler_params = Profiler(cpu_profiling_duration=3600)
)

estimator = PyTorch(
    framework_version="2.0.0",
    ... # Set up other essential parameters for the estimator class
    profiler_config=profiler_config
)

For a complete instruction on activating and using SageMaker Profiler, see Use Amazon SageMaker Profiler to profile activities on AWS compute resources.

Profiler Rule APIs

The following API is for setting up SageMaker Debugger’s profiler rules to detect computational performance issues from training jobs.

class sagemaker.debugger.ProfilerRule(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters)

The SageMaker Debugger ProfilerRule class configures profiling rules.

SageMaker Debugger profiling rules automatically analyze hardware system resource utilization and framework metrics of a training job to identify performance bottlenecks.

SageMaker Debugger comes pre-packaged with built-in profiling rules. For example, the profiling rules can detect if GPUs are underutilized due to CPU bottlenecks or IO bottlenecks. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules. You can also write your own profiling rules using the Amazon SageMaker Debugger APIs.

Tip

Use the following ProfilerRule.sagemaker class method for built-in profiling rules or the ProfilerRule.custom class method for custom profiling rules. Do not directly use the Rule initialization method.

Method generated by attrs for class RuleBase.

classmethod sagemaker(base_config, name=None, container_local_output_path=None, s3_output_path=None, instance_type=None, volume_size_in_gb=None)

Initialize a ProfilerRule object for a built-in profiling rule.

The rule analyzes system and framework metrics of a given training job to identify performance bottlenecks.

Parameters
  • base_config (rule_configs.ProfilerRule) –

    The base rule configuration object returned from the rule_configs method. For example, ‘rule_configs.ProfilerReport()’. For a full list of built-in rules for debugging, see List of Debugger Built-in Rules.

  • name (str) – The name of the profiler rule. If one is not provided, the name of the base_config will be used.

  • container_local_output_path (str) – The path in the container.

  • s3_output_path (str) – The location in Amazon S3 to store the profiling output data. The default Debugger output path for profiling data is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.

Returns

The instance of the built-in ProfilerRule.

Return type

ProfilerRule

classmethod custom(name, image_uri, instance_type, volume_size_in_gb, source=None, rule_to_invoke=None, container_local_output_path=None, s3_output_path=None, rule_parameters=None)

Initialize a ProfilerRule object for a custom profiling rule.

You can create a rule that analyzes system and framework metrics emitted during the training of a model and monitors conditions that are critical for the success of a training job.

Parameters
  • name (str) – The name of the profiler rule.

  • image_uri (str) – The URI of the image to be used by the proflier rule.

  • instance_type (str) – Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data.

  • source (str) – A source file containing a rule to invoke. If provided, you must also provide rule_to_invoke. This can either be an S3 uri or a local path.

  • rule_to_invoke (str) – The name of the rule to invoke within the source. If provided, you must also provide the source.

  • container_local_output_path (str) – The path in the container.

  • s3_output_path (str) – The location in Amazon S3 to store the output. The default Debugger output path for profiling data is created under the default output path of the Estimator class. For example, s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/.

  • rule_parameters (dict) – A dictionary of parameters for the rule.

Returns

The instance of the custom ProfilerRule.

Return type

ProfilerRule

to_profiler_rule_config_dict()

Generates a request dictionary using the parameters provided when initializing object.

Returns

An portion of an API request as a dictionary.

Return type

dict

Debugger Configuration APIs for Framework Profiling (Deprecated)

Warning

In favor of Amazon SageMaker Profiler, SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.

  • SageMaker Python SDK <= v2.130.0

  • PyTorch >= v1.6.0, < v2.0

  • TensorFlow >= v2.3.1, < v2.11

With the deprecation, SageMaker Debugger discontinues support for the APIs below this note.

See also Amazon SageMaker Debugger Release Notes: March 16, 2023.

class sagemaker.debugger.FrameworkProfile(local_path='/opt/ml/output/profiler', file_max_size=10485760, file_close_interval=60, file_open_fail_threshold=50, detailed_profiling_config=None, dataloader_profiling_config=None, python_profiling_config=None, horovod_profiling_config=None, smdataparallel_profiling_config=None, start_step=None, num_steps=None, start_unix_time=None, duration=None)

Bases: object

Sets up the profiling configuration for framework metrics.

Validates user inputs and fills in default values if no input is provided. There are three main profiling options to choose from: DetailedProfilingConfig, DataloaderProfilingConfig, and PythonProfilingConfig.

The following list shows available scenarios of configuring the profiling options.

1. None of the profiling configuration, step range, or time range is specified. SageMaker Debugger activates framework profiling based on the default settings of each profiling option.

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
    framework_profile_params=FrameworkProfile()
)

2. Target step or time range is specified to this FrameworkProfile class. The requested target step or time range setting propagates to all of the framework profiling options. For example, if you configure this class as following, all of the profiling options profiles the 6th step:

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
    framework_profile_params=FrameworkProfile(start_step=6, num_steps=1)
)

3. Individual profiling configurations are specified through the *_profiling_config parameters. SageMaker Debugger profiles framework metrics only for the specified profiling configurations. For example, if the DetailedProfilingConfig class is configured but not the other profiling options, Debugger only profiles based on the settings specified to the DetailedProfilingConfig class. For example, the following example shows a profiling configuration to perform detailed profiling at step 10, data loader profiling at step 9 and 10, and Python profiling at step 12.

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
    framework_profile_params=FrameworkProfile(
        detailed_profiling_config=DetailedProfilingConfig(start_step=10, num_steps=1),
        dataloader_profiling_config=DataloaderProfilingConfig(start_step=9, num_steps=2),
        python_profiling_config=PythonProfilingConfig(start_step=12, num_steps=1),
    )
)

If the individual profiling configurations are specified in addition to the step or time range, SageMaker Debugger prioritizes the individual profiling configurations and ignores the step or time range. For example, in the following code, the start_step=1 and num_steps=10 will be ignored.

from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config=ProfilerConfig(
    framework_profile_params=FrameworkProfile(
        start_step=1,
        num_steps=10,
        detailed_profiling_config=DetailedProfilingConfig(start_step=10, num_steps=1),
        dataloader_profiling_config=DataloaderProfilingConfig(start_step=9, num_steps=2),
        python_profiling_config=PythonProfilingConfig(start_step=12, num_steps=1)
    )
)

Initialize the FrameworkProfile class object.

Parameters
  • detailed_profiling_config (DetailedProfilingConfig) –

    The configuration for detailed profiling. Configure it using the DetailedProfilingConfig class. Pass DetailedProfilingConfig() to use the default configuration.

    Warning

    This detailed framework profiling feature discontinues support for TensorFlow v2.11 and later. To use the detailed profiling feature, use previous versions of TensorFlow between v2.3.1 and v2.10.0.

  • dataloader_profiling_config (DataloaderProfilingConfig) – The configuration for dataloader metrics profiling. Configure it using the DataloaderProfilingConfig class. Pass DataloaderProfilingConfig() to use the default configuration.

  • python_profiling_config (PythonProfilingConfig) – The configuration for stats collected by the Python profiler (cProfile or Pyinstrument). Configure it using the PythonProfilingConfig class. Pass PythonProfilingConfig() to use the default configuration.

  • start_step (int) – The step at which to start profiling.

  • num_steps (int) – The number of steps to profile.

  • start_unix_time (int) – The Unix time at which to start profiling.

  • duration (float) – The duration in seconds to profile.

Tip

Available profiling range parameter pairs are (start_step and num_steps) and (start_unix_time and duration). The two parameter pairs are mutually exclusive, and this class validates if one of the two pairs is used. If both pairs are specified, a conflict error occurs.

class sagemaker.debugger.DetailedProfilingConfig(start_step=None, num_steps=None, start_unix_time=None, duration=None, profile_default_steps=False)

Bases: MetricsConfigBase

The configuration for framework metrics to be collected for detailed profiling.

Specify target steps or a target duration to profile.

By default, it profiles step 5 of the training job.

If profile_default_steps is set to True and none of the other range parameters is specified, the class uses the default configuration for detailed profiling.

Parameters
  • start_step (int) – The step to start profiling. The default is step 5.

  • num_steps (int) – The number of steps to profile. The default is for 1 step.

  • start_unix_time (int) – The Unix time to start profiling.

  • duration (float) – The duration in seconds to profile.

  • profile_default_steps (bool) – Indicates whether the default config should be used.

Tip

Available profiling range parameter pairs are (start_step and num_steps) and (start_unix_time and duration). The two parameter pairs are mutually exclusive, and this class validates if one of the two pairs is used. If both pairs are specified, a conflict error occurs.

Warning

This detailed framework profiling feature discontinues support for TensorFlow v2.11 and later. To use the detailed profiling feature, use previous versions of TensorFlow between v2.3.1 and v2.10.0.

class sagemaker.debugger.DataloaderProfilingConfig(start_step=None, num_steps=None, start_unix_time=None, duration=None, profile_default_steps=False, metrics_regex='.*')

Bases: MetricsConfigBase

The configuration for framework metrics to be collected for data loader profiling.

Specify target steps or a target duration to profile.

By default, it profiles step 7 of training. If profile_default_steps is set to True and none of the other range parameters is specified, the class uses the default config for dataloader profiling.

Parameters
  • start_step (int) – The step to start profiling. The default is step 7.

  • num_steps (int) – The number of steps to profile. The default is for 1 step.

  • start_unix_time (int) – The Unix time to start profiling. The default is for 1 step.

  • duration (float) – The duration in seconds to profile.

  • profile_default_steps (bool) – Indicates whether the default config should be used.

class sagemaker.debugger.PythonProfilingConfig(start_step=None, num_steps=None, start_unix_time=None, duration=None, profile_default_steps=False, python_profiler=PythonProfiler.CPROFILE, cprofile_timer=cProfileTimer.TOTAL_TIME)

Bases: MetricsConfigBase

The configuration for framework metrics to be collected for Python profiling.

Choose a Python profiler: cProfile or Pyinstrument.

Specify target steps or a target duration to profile. If no parameter is specified, it profiles based on profiling configurations preset by the profile_default_steps parameter, which is set to True by default. If you specify the following parameters, then the profile_default_steps parameter will be ignored.

Parameters
  • start_step (int) – The step to start profiling. The default is step 9.

  • num_steps (int) – The number of steps to profile. The default is for 3 steps.

  • start_unix_time (int) – The Unix time to start profiling.

  • duration (float) – The duration in seconds to profile.

  • profile_default_steps (bool) – Indicates whether the default configuration should be used. If set to True, Python profiling will be done at step 9, 10, and 11 of training, using cProfiler and collecting metrics based on the total time, cpu time, and off cpu time for these three steps respectively. The default is True.

  • python_profiler (PythonProfiler) – The Python profiler to use to collect python profiling stats. Available options are "cProfile" and "Pyinstrument". The default is "cProfile". Instead of passing the string values, you can also use the enumerator util, PythonProfiler, to choose one of the available options.

  • cprofile_timer (cProfileTimer) – The timer to be used by cProfile when collecting python profiling stats. Available options are "total_time", "cpu_time", and "off_cpu_time". The default is "total_time". If you choose Pyinstrument, this parameter is ignored. Instead of passing the string values, you can also use the enumerator util, cProfileTimer, to choose one of the available options.

class sagemaker.debugger.PythonProfiler(value)

Bases: Enum

Enum to list the Python profiler options for Python profiling.

CPROFILE

Use to choose "cProfile".

PYINSTRUMENT

Use to choose "Pyinstrument".

class sagemaker.debugger.cProfileTimer(value)

Bases: Enum

Enum to list the possible cProfile timers for Python profiling.

TOTAL_TIME

Use to choose "total_time".

CPU_TIME

Use to choose "cpu_time".

OFF_CPU_TIME

Use to choose "off_cpu_time".

The various types of metrics configurations that can be specified in FrameworkProfile.

class sagemaker.debugger.metrics_config.StepRange(start_step, num_steps)

Configuration for the range of steps to profile.

It returns the target steps in dictionary format that you can pass to the FrameworkProfile class.

Set the start step and num steps.

If the start step is not specified, Debugger starts profiling at step 0. If num steps is not specified, profile for 1 step.

Parameters
  • start_step (int) – The step to start profiling.

  • num_steps (int) – The number of steps to profile.

to_json()

Convert the step range into a dictionary.

Returns

The step range as a dictionary.

Return type

dict

class sagemaker.debugger.metrics_config.TimeRange(start_unix_time, duration)

Configuration for the range of Unix time to profile.

It returns the target time duration in dictionary format that you can pass to the FrameworkProfile class.

Set the start Unix time and duration.

If the start Unix time is not specified, profile starting at step 0. If the duration is not specified, profile for 1 step.

Parameters
  • start_unix_time (int) – The Unix time to start profiling.

  • duration (float) – The duration in seconds to profile.

to_json()

Convert the time range into a dictionary.

Returns

The time range as a dictionary.

Return type

dict