Nova Data Mixing#
Data mixing blends your custom training data with Nova’s curated synthetic datasets (code, math, chat, planning, instruction-following, reasoning, etc.) to prevent catastrophic forgetting while specializing the model on your domain.
Important: Data mixing is only supported with serverless compute type. It is not available for serverful training jobs (SMTJ) or HyperPod clusters.
What you will learn#
Configure
DataMixingConfigwith customer and Nova data percentagesCreate an
SFTTrainerwith data mixing enabledSet hyperparameters and submit a training job
Monitor job status
1. Setup#
import json
import boto3
# === Fill in your AWS resources ===
REGION = "<your-region>" # e.g. "us-east-1"
ROLE_ARN = "<your-execution-role-arn>"
S3_BUCKET = "<your-s3-bucket>" # e.g. "sagemaker-us-east-1-123456789012"
S3_OUTPUT_PATH = f"s3://{S3_BUCKET}/sft-data-mixing/output"
TRAINING_DATASET = f"s3://{S3_BUCKET}/datasets/sft_training_data.jsonl"
2. Configure Data Mixing#
Data mixing controls the blend between your custom training data and Nova’s internal
curated datasets. customer_data_percent sets how much of the training data comes from
your dataset. The remaining portion is distributed among Nova categories according to
nova_data_percentages.
Available Nova categories include: code, math, chat, planning,
instruction-following, reasoning, stem, rag, factuality, etc.
from sagemaker.train.data_mixing_config import DataMixingConfig
# 70% of training data from your dataset, 30% from Nova curated data
# Within Nova data: 30% code, 70% math
data_mixing_config = DataMixingConfig(
customer_data_percent=70.0,
nova_data_percentages={
"code": 30.0,
"math": 70.0,
},
)
3. Create SFTTrainer with Data Mixing#
Pass the DataMixingConfig to SFTTrainer. Since data mixing only works with
serverless compute, no compute parameter is needed.
from sagemaker.train import SFTTrainer
from sagemaker.train.common import TrainingType
sft_trainer = SFTTrainer(
model="amazon.nova-2-lite-v1",
training_type=TrainingType.LORA,
training_dataset=TRAINING_DATASET,
s3_output_path=S3_OUTPUT_PATH,
role=ROLE_ARN,
data_mixing_config=data_mixing_config,
base_job_name="sft-datamix",
)
4. Set Hyperparameters and Submit Training Job#
# Set hyperparameters
sft_trainer.hyperparameters.max_steps = 50
sft_trainer.hyperparameters.learning_rate = 5e-6
sft_trainer.hyperparameters.global_batch_size = 32
# Submit (non-blocking)
training_job = sft_trainer.train(wait=False)
print(f"Training job submitted: {training_job}")
5. Monitor Training Job#
from sagemaker.core.resources import TrainingJob
job = TrainingJob.get(training_job_name=training_job.training_job_name)
print(f"Status: {job.training_job_status}")
print(f"Secondary Status: {job.secondary_status}")
6. Alternative: Different Data Mix Configurations#
Here are some common configuration patterns depending on your use case.
# High specialization: mostly your data
high_specialization = DataMixingConfig(
customer_data_percent=90.0,
nova_data_percentages={
"reasoning": 100.0,
},
)
# Balanced: equal split with multiple Nova categories
balanced_mix = DataMixingConfig(
customer_data_percent=50.0,
nova_data_percentages={
"code": 40.0,
"reasoning": 30.0,
"math": 30.0,
},
)
# Light specialization: preserve broad capabilities
light_specialization = DataMixingConfig(
customer_data_percent=30.0,
nova_data_percentages={
"code": 25.0,
"math": 25.0,
"chat": 25.0,
"reasoning": 25.0,
},
)
Tips#
High customer_data_percent (80–90%) — Use when your task is well-defined and you have enough data.
Balanced (50–70%) — Good default for most use cases.
Low customer_data_percent (20–40%) — Preserve base model capabilities with light specialization.
Nova category selection — Choose categories that complement your task (e.g.,
code+reasoningfor a coding assistant).