Nova Data Mixing

Nova Data Mixing#

Data mixing blends your custom training data with Nova’s curated synthetic datasets (code, math, chat, planning, instruction-following, reasoning, etc.) to prevent catastrophic forgetting while specializing the model on your domain.

Important

Data mixing is only supported with serverless compute type. It is not available for serverful training jobs (SMTJ) or HyperPod clusters.

How Data Mixing Works#

During fine-tuning, data mixing controls the proportion of training samples that come from your custom dataset versus Nova’s internal curated datasets. By mixing in curated data, the base model retains broad capabilities (e.g., code generation, mathematical reasoning) while adapting to your specific task.

Your data (customer_data_percent%) + Nova curated data (remaining%)
                                         ├── code
                                         ├── math
                                         ├── chat
                                         ├── planning
                                         ├── instruction-following
                                         ├── reasoning
                                         ├── stem
                                         ├── rag
                                         └── factuality

Configuration#

Use DataMixingConfig to specify the blend:

from sagemaker.train.data_mixing_config import DataMixingConfig

data_mixing_config = DataMixingConfig(
    customer_data_percent=70.0,
    nova_data_percentages={
        "code": 30.0,
        "math": 70.0,
    },
)

Parameters:

customer_data_percent — percentage of training data that comes from your dataset (0–100)
nova_data_percentages — distribution of Nova’s curated datasets for the remaining portion. The values represent the relative weight among the selected Nova categories. Available categories include: code, math, chat, planning, instruction-following, reasoning, stem, rag, factuality, etc.

Note

The values in nova_data_percentages must sum to 100.

Example: SFT with Data Mixing (Serverless)#

import json
import boto3

REGION = "us-east-1"
ROLE_ARN = "arn:aws:iam::123456789012:role/MySageMakerRole"
S3_BUCKET = "sagemaker-us-east-1-123456789012"
S3_OUTPUT_PATH = f"s3://{S3_BUCKET}/sft-data-mixing/output"
TRAINING_DATASET = f"s3://{S3_BUCKET}/datasets/sft_training_data.jsonl"

from sagemaker.train import SFTTrainer
from sagemaker.train.common import TrainingType
from sagemaker.train.data_mixing_config import DataMixingConfig

data_mixing_config = DataMixingConfig(
    customer_data_percent=70.0,
    nova_data_percentages={
        "code": 30.0,
        "math": 70.0,
    },
)

sft_trainer = SFTTrainer(
    model="amazon.nova-2-lite-v1",
    training_type=TrainingType.LORA,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    role=ROLE_ARN,
    data_mixing_config=data_mixing_config,
    base_job_name="sft-datamix",
)

# Set hyperparameters
sft_trainer.hyperparameters.max_steps = 50
sft_trainer.hyperparameters.learning_rate = 5e-6
sft_trainer.hyperparameters.global_batch_size = 32

# Submit (non-blocking)
training_job = sft_trainer.train(wait=False)
print(f"Training job submitted: {training_job}")

Tips for Choosing Data Mix Percentages#

High customer_data_percent (80–90%) — Use when your domain-specific task is well-defined and you have enough training data. The model will specialize quickly but may lose some general capabilities.
Balanced customer_data_percent (50–70%) — Good default for most use cases. Preserves broad capabilities while still adapting to your domain.
Low customer_data_percent (20–40%) — Use when you want to mostly preserve the base model’s capabilities with light specialization, or when your training dataset is small.
Nova category selection — Choose categories that complement your task. For example, if fine-tuning for a coding assistant, include code and reasoning in the Nova mix to reinforce those capabilities.

Interactive Notebook#

For a complete walkthrough, see the Data Mixing notebook.