Fine-Tuning with HyperPod

Fine-Tuning with HyperPod#

This notebook demonstrates fine-tuning Amazon Nova models using SFTTrainer on SageMaker HyperPod managed clusters with recipe overrides.

HyperPod provides managed cluster orchestration with support for multi-node distributed training.

What you will learn#

Install the HyperPod CLI prerequisites
Create an SFTTrainer with HyperPodCompute
Use YAML recipe files with selective overrides
Use deep nested overrides for fine-grained control
Submit and monitor a training job

1. Prerequisites: HyperPod CLI Installation#

HyperPod-based training requires the SageMaker HyperPod CLI to connect to clusters and start jobs.

Note: If you are a Nova Forge customer, download the HyperPod CLI with Forge feature support from S3 instead. See the Nova Forge SDK docs.

2. Setup#

import json
import os

# Required for HyperPod CLI recipe resolution
os.environ["PYTHONPATH"] = (
    "<path-to-your-hyperpod-cli>/hyperpod_cli/"
    "sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:"
    + os.environ.get("PYTHONPATH", "")
)

# === Fill in your AWS resources ===
REGION = "<your-region>"  # e.g. "us-east-1"
S3_BUCKET = "<your-s3-bucket>"  # e.g. "sagemaker-us-east-1-123456789012"

S3_OUTPUT_PATH = f"s3://{S3_BUCKET}/sft-hyperpod/output"
TRAINING_DATASET = f"s3://{S3_BUCKET}/datasets/sft_training_data.jsonl"
CLUSTER_NAME = "<your-cluster-name>"  # e.g. "my-cluster"
NAMESPACE = "<your-namespace>"  # e.g. "kubeflow"

3. Create SFTTrainer with HyperPod Compute#

Create a trainer with HyperPodCompute, which routes the job to your managed cluster.

from sagemaker.train import SFTTrainer
from sagemaker.train.common import TrainingType
from sagemaker.core.training.configs import HyperPodCompute

compute = HyperPodCompute(
    cluster_name=CLUSTER_NAME,
    namespace=NAMESPACE,
    instance_type="ml.p5.48xlarge",
    node_count=2,
)

sft_trainer = SFTTrainer(
    model="nova-textgeneration-micro",
    training_type=TrainingType.LORA,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    compute=compute,
    base_job_name="sft-hp",
)

4. Using Recipe Overrides#

You can provide a YAML recipe file with your training configuration, then selectively override specific parameters via the overrides dict. The override takes precedence over values in the recipe file.

This is useful when you want a shared base recipe but need to experiment with specific hyperparameters (e.g., learning rate or number of epochs) without modifying the file.

Precedence order: overrides dict > recipe YAML > Hub/SDK defaults

import yaml

# Create a custom recipe YAML
recipe_config = {
    "training": {
        "learning_rate": 1e-5,
        "num_epochs": 3,
        "batch_size": 8,
        "sequence_length": 2048,
    }
}

with open("my_sft_recipe_hp.yaml", "w") as f:
    yaml.dump(recipe_config, f)

print("Recipe file contents:")
print(yaml.dump(recipe_config, default_flow_style=False))

# Create trainer with recipe + overrides
# Here we override learning_rate (1e-5 -> 5e-6) and num_epochs (3 -> 5)
# from the recipe file above, while keeping batch_size and sequence_length unchanged
sft_trainer_with_recipe = SFTTrainer(
    model="nova-textgeneration-micro",
    training_type=TrainingType.LORA,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    compute=HyperPodCompute(
        cluster_name=CLUSTER_NAME,
        namespace=NAMESPACE,
        instance_type="ml.p5.48xlarge",
        node_count=1,
    ),
    recipe="my_sft_recipe_hp.yaml",
    overrides={"training_config": {"learning_rate": 5e-6, "num_epochs": 5}},
    base_job_name="sft-recipe-override-hp",
)

# Inspect the resolved recipe to confirm overrides were applied
resolved = sft_trainer_with_recipe.get_resolved_recipe()
print("Resolved recipe:")
print(json.dumps(resolved, indent=2))

5. Deep Nested Overrides#

For fine-grained control on HyperPod, use dotted path overrides to target specific nested recipe parameters without restructuring the entire override dict.

# Deep nested override targets a specific path in the recipe hierarchy
sft_trainer_nested = SFTTrainer(
    model="amazon.nova-2-lite-v1",
    training_type=TrainingType.LORA,
    compute=HyperPodCompute(
        cluster_name=CLUSTER_NAME,
        namespace=NAMESPACE,
        instance_type="ml.p5.48xlarge",
        node_count=2,
    ),
    base_job_name="sft-nested-override-hp",
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    # Dotted path reaches into: recipes -> training_config -> optim_config -> lr
    overrides={"recipes.training_config.optim_config.lr": 5e-6},
)

sft_trainer_nested.hyperparameters.max_steps = 10
sft_trainer_nested.hyperparameters.global_batch_size = 64

6. Set Hyperparameters and Submit Training Job#

sft_trainer.hyperparameters.max_steps = 10
sft_trainer.hyperparameters.global_batch_size = 64

# Submit (non-blocking)
sft_job = sft_trainer.train(wait=False)
print(f"HyperPod job submitted: {sft_job}")