Fine-Tuning with Serverful Training Jobs (SMTJ)

Fine-Tuning with Serverful Training Jobs (SMTJ)#

This notebook demonstrates fine-tuning Amazon Nova models using SFTTrainer on serverful SageMaker Training Job instances with recipe overrides.

What you will learn#

Create an SFTTrainer with TrainingJobCompute
Use YAML recipe files with selective overrides
Inspect the resolved recipe before submission
Submit and monitor a training job
Use the low-level ModelTrainer.from_recipe() API

1. Setup#

import json
import boto3

# === Fill in your AWS resources ===
REGION = "<your-region>"  # e.g. "us-east-1"
ROLE_ARN = "<your-execution-role-arn>"
S3_BUCKET = "<your-s3-bucket>"  # e.g. "sagemaker-us-east-1-123456789012"

S3_OUTPUT_PATH = f"s3://{S3_BUCKET}/sft-smtj/output"
TRAINING_DATASET = f"s3://{S3_BUCKET}/datasets/sft_training_data.jsonl"

2. Create SFTTrainer#

Create a trainer with serverful compute (TrainingJobCompute). This provisions dedicated instances for the training job.

from sagemaker.train import SFTTrainer
from sagemaker.train.common import TrainingType
from sagemaker.core.training.configs import TrainingJobCompute

sft_trainer = SFTTrainer(
    model="amazon.nova-2-lite-v1",
    training_type=TrainingType.LORA,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    compute=TrainingJobCompute(
        instance_type="ml.p5.48xlarge",
        instance_count=2,
    ),
    role=ROLE_ARN,
    base_job_name="sft-smtj",
)

3. Using Recipe Overrides#

You can provide a YAML recipe file with your training configuration, then selectively override specific parameters via the overrides dict. The override takes precedence over values in the recipe file.

This is useful when you want a shared base recipe but need to experiment with specific hyperparameters (e.g., learning rate or number of epochs) without modifying the file.

Precedence order: overrides dict > recipe YAML > Hub/SDK defaults

import yaml

# Write a custom recipe YAML
recipe_config = {
    "training": {
        "learning_rate": 1e-5,
        "num_epochs": 3,
        "batch_size": 8,
        "sequence_length": 2048,
    }
}

with open("my_sft_recipe.yaml", "w") as f:
    yaml.dump(recipe_config, f)

print("Recipe file contents:")
print(yaml.dump(recipe_config, default_flow_style=False))

# Create trainer with recipe + overrides
# Here we override learning_rate (1e-5 -> 5e-6) and num_epochs (3 -> 5)
# from the recipe file above, while keeping batch_size and sequence_length unchanged
sft_trainer_with_recipe = SFTTrainer(
    model="amazon.nova-2-lite-v1",
    training_type=TrainingType.LORA,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    compute=TrainingJobCompute(
        instance_type="ml.p5.48xlarge",
        instance_count=2,
    ),
    role=ROLE_ARN,
    recipe="my_sft_recipe.yaml",
    overrides={"training_config": {"learning_rate": 5e-6, "num_epochs": 5}},
    base_job_name="sft-recipe-override-smtj",
)

# Inspect the resolved recipe to confirm overrides were applied
resolved = sft_trainer_with_recipe.get_resolved_recipe()
print("Resolved training_config:")
print(json.dumps(resolved.get("training_config", resolved), indent=2))

4. Set Hyperparameters and Submit Training Job#

sft_trainer.hyperparameters.max_steps = 50
sft_trainer.hyperparameters.learning_rate = 5e-6
sft_trainer.hyperparameters.global_batch_size = 32

# Submit (non-blocking)
training_job = sft_trainer.train(wait=False)
print(f"Training job submitted: {training_job}")

5. Monitor Training Job#

from sagemaker.core.resources import TrainingJob

job = TrainingJob.get(training_job_name=training_job.training_job_name)
print(f"Status: {job.training_job_status}")
print(f"Secondary Status: {job.secondary_status}")

6. Low-Level Alternative: ModelTrainer.from_recipe#

For maximum control, use ModelTrainer.from_recipe() directly. This gives you full access to the recipe structure without the high-level trainer abstraction.

import yaml
from sagemaker.train import ModelTrainer
from sagemaker.train.configs import Compute

yaml.dump({
    "run": {
        "model_type": "amazon.nova.lite",
        "model_name_or_path": "nova-textgeneration-lite-v2",
        "replicas": 1,
    },
    "training_config": {
        "learning_rate": 1e-5,
        "num_epochs": 3,
        "batch_size": 4,
    },
}, open("my_nova_recipe.yaml", "w"))

model_trainer = ModelTrainer.from_recipe(
    training_recipe="my_nova_recipe.yaml",
    compute=Compute(instance_type="ml.p5.48xlarge", instance_count=1),
    training_image="<your-nova-training-image-uri>",
    recipe_overrides={"training_config": {"learning_rate": 5e-6, "num_epochs": 5}},
)

resolved = model_trainer.get_resolved_recipe()
print(json.dumps(resolved.get("training_config", resolved), indent=2))