Nova Data Mixing

Nova Data Mixing#

Data mixing blends your custom training data with Nova’s curated synthetic datasets (code, math, chat, planning, instruction-following, reasoning, etc.) to prevent catastrophic forgetting while specializing the model on your domain.

Important: Data mixing is only supported with serverless compute type. It is not available for serverful training jobs (SMTJ) or HyperPod clusters.

What you will learn#

Configure DataMixingConfig with customer and Nova data percentages
Create an SFTTrainer with data mixing enabled
Set hyperparameters and submit a training job
Monitor job status

1. Setup#

import json
import boto3

# === Fill in your AWS resources ===
REGION = "<your-region>"  # e.g. "us-east-1"
ROLE_ARN = "<your-execution-role-arn>"
S3_BUCKET = "<your-s3-bucket>"  # e.g. "sagemaker-us-east-1-123456789012"

S3_OUTPUT_PATH = f"s3://{S3_BUCKET}/sft-data-mixing/output"
TRAINING_DATASET = f"s3://{S3_BUCKET}/datasets/sft_training_data.jsonl"

2. Configure Data Mixing#

Data mixing controls the blend between your custom training data and Nova’s internal curated datasets. customer_data_percent sets how much of the training data comes from your dataset. The remaining portion is distributed among Nova categories according to nova_data_percentages.

Available Nova categories include: code, math, chat, planning, instruction-following, reasoning, stem, rag, factuality, etc.

from sagemaker.train.data_mixing_config import DataMixingConfig

# 70% of training data from your dataset, 30% from Nova curated data
# Within Nova data: 30% code, 70% math
data_mixing_config = DataMixingConfig(
    customer_data_percent=70.0,
    nova_data_percentages={
        "code": 30.0,
        "math": 70.0,
    },
)

3. Create SFTTrainer with Data Mixing#

Pass the DataMixingConfig to SFTTrainer. Since data mixing only works with serverless compute, no compute parameter is needed.

from sagemaker.train import SFTTrainer
from sagemaker.train.common import TrainingType

sft_trainer = SFTTrainer(
    model="amazon.nova-2-lite-v1",
    training_type=TrainingType.LORA,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    role=ROLE_ARN,
    data_mixing_config=data_mixing_config,
    base_job_name="sft-datamix",
)

4. Set Hyperparameters and Submit Training Job#

# Set hyperparameters
sft_trainer.hyperparameters.max_steps = 50
sft_trainer.hyperparameters.learning_rate = 5e-6
sft_trainer.hyperparameters.global_batch_size = 32

# Submit (non-blocking)
training_job = sft_trainer.train(wait=False)
print(f"Training job submitted: {training_job}")

5. Monitor Training Job#

from sagemaker.core.resources import TrainingJob

job = TrainingJob.get(training_job_name=training_job.training_job_name)
print(f"Status: {job.training_job_status}")
print(f"Secondary Status: {job.secondary_status}")

6. Alternative: Different Data Mix Configurations#

Here are some common configuration patterns depending on your use case.

# High specialization: mostly your data
high_specialization = DataMixingConfig(
    customer_data_percent=90.0,
    nova_data_percentages={
        "reasoning": 100.0,
    },
)

# Balanced: equal split with multiple Nova categories
balanced_mix = DataMixingConfig(
    customer_data_percent=50.0,
    nova_data_percentages={
        "code": 40.0,
        "reasoning": 30.0,
        "math": 30.0,
    },
)

# Light specialization: preserve broad capabilities
light_specialization = DataMixingConfig(
    customer_data_percent=30.0,
    nova_data_percentages={
        "code": 25.0,
        "math": 25.0,
        "chat": 25.0,
        "reasoning": 25.0,
    },
)

Tips#

High customer_data_percent (80–90%) — Use when your task is well-defined and you have enough data.
Balanced (50–70%) — Good default for most use cases.
Low customer_data_percent (20–40%) — Preserve base model capabilities with light specialization.
Nova category selection — Choose categories that complement your task (e.g., code + reasoning for a coding assistant).