SageMaker LLM-as-Judge Evaluation#

This notebook demonstrates LLM-as-Judge evaluation using LLMAsJudgeEvaluator.

Part 1: Basic Usage — Create and manage evaluation jobs with custom metrics.

Part 2: Custom Models (Nova only) — Evaluate fine-tuned Nova models via Model Package ARN using the InspectAI-based inference path.

# Configure AWS credentials and region
#! ada credentials update --provider=isengard --account=<> --role=Admin --profile=default --once
#! aws configure set region us-west-2

Configuration#

# Configuration
REGION = 'us-west-2'
S3_BUCKET = 's3://mufi-test-serverless-smtj/eval/'
# DATASET = 'arn:aws:sagemaker:us-west-2:<>:hub-content/AIRegistry/DataSet/gen-qa-test-content/1.0.1'  # Dataset ARN or S3 URI
DATASET = "s3://my-sagemaker-sherpa-dataset/dataset/gen-qa-formatted-dataset/gen_qa.jsonl"
MLFLOW_ARN = 'arn:aws:sagemaker:us-west-2:<>:mlflow-tracking-server/mmlu-eval-experiment'

Step 1: Import Required Libraries#

Import the LLMAsJudgeEvaluator class.

import json
from sagemaker.train.evaluate import LLMAsJudgeEvaluator
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

Step 2: Create LLMAsJudgeEvaluator#

Create an LLMAsJudgeEvaluator instance with the desired evaluator model, dataset, and metrics.

Key Parameters:#

  • model: Model Package (or Base Model) to be evaluated (required)

  • evaluator_model: Bedrock model ID to use as judge (required)

  • dataset: S3 URI or Dataset ARN (required)

  • builtin_metrics: List of built-in metrics (optional, no ‘Builtin.’ prefix needed)

  • custom_metrics: JSON string of custom metrics (optional)

  • evaluate_base_model: Whether to evaluate base model in addition to custom model (optional, default=True)

  • mlflow_resource_arn: MLflow tracking server ARN (optional)

  • model_package_group: Model package group ARN (optional)

  • s3_output_path: S3 output location (required)

A. Using custom metrics (as JSON string)#

Custom metrics must be provided as a properly escaped JSON string. You can either:

  1. Create a Python dict and use json.dumps() to convert it

  2. Provide a pre-escaped JSON string directly

# Method 1: Create dict and convert to JSON string
custom_metric_dict = {
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}

# Convert to JSON string
custom_metrics_json = json.dumps([custom_metric_dict])  # Note: wrap in list
# Create evaluator with custom metrics
evaluator = LLMAsJudgeEvaluator(
    # base_model='arn:aws:sagemaker:us-west-2:<>:model-package/Demo-test-deb-2/1',  # Required
    model="arn:aws:sagemaker:us-west-2:<>:model-package/test-finetuned-models-gamma/28",
    evaluator_model="anthropic.claude-3-5-haiku-20241022-v1:0",  # Required
    dataset=DATASET,  # Required: S3 URI or Dataset ARN
    builtin_metrics=["Completeness", "Faithfulness"],  # Optional: Can combine with custom metrics
    custom_metrics=custom_metrics_json,  # Optional: JSON string of custom metrics
    mlflow_resource_arn=MLFLOW_ARN,  # Optional
    # model_package_group=MODEL_PACKAGE_GROUP,  # Optional if BASE_MODEL is a Model Package ARN/Object
    s3_output_path=S3_BUCKET,  # Required
    evaluate_base_model=False
)

pprint(evaluator)

[Optional] Example with multiple custom metrics#

# # Create multiple custom metrics
# custom_metrics_list = [
#     {
#         "customMetricDefinition": {
#             "name": "GoodMetric",
#             "instructions": (
#                 "Assess if the response has positive sentiment. "
#                 "Prompt: {{prompt}}\nResponse: {{prediction}}"
#             ),
#             "ratingScale": [
#                 {"definition": "Good", "value": {"floatValue": 1}},
#                 {"definition": "Poor", "value": {"floatValue": 0}}
#             ]
#         }
#     },
#     {
#         "customMetricDefinition": {
#             "name": "BadMetric",
#             "instructions": (
#                 "Assess if the response has negative sentiment. "
#                 "Prompt: {{prompt}}\nResponse: {{prediction}}"
#             ),
#             "ratingScale": [
#                 {"definition": "Bad", "value": {"floatValue": 1}},
#                 {"definition": "Good", "value": {"floatValue": 0}}
#             ]
#         }
#     }
# ]

# # Convert list to JSON string
# custom_metrics_json = json.dumps(custom_metrics_list)

# # Create evaluator
# evaluator = LLMAsJudgeEvaluator(
#     base_model=BASE_MODEL,
#     evaluator_model="anthropic.claude-3-5-haiku-20241022-v1:0",
#     dataset=DATASET,
#     custom_metrics=custom_metrics_json,  # Multiple custom metrics
#     s3_output_path=S3_BUCKET,
# )

# print(f"✅ Created evaluator with {len(json.loads(custom_metrics_json))} custom metrics")
# pprint(evaluator)

[Optional] Skipping base model evaluation (evaluate custom model only)#

By default, LLM-as-Judge evaluates both the base model and custom model. You can skip base model evaluation to save time and cost by setting evaluate_base_model=False.

# # Define custom metrics (same as test script)
# custom_metrics = "[{\"customMetricDefinition\":{\"name\":\"GoodMetric\",\"instructions\":\"You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\\n\\nConsider the following:\\n- Does the response have a positive, encouraging tone?\\n- Is the response helpful and constructive?\\n- Does it avoid negative language or criticism?\\n\\nRate on this scale:\\n- Good: Response has positive sentiment\\n- Poor: Response lacks positive sentiment\\n\\nHere is the actual task:\\nPrompt: {{prompt}}\\nResponse: {{prediction}}\",\"ratingScale\":[{\"definition\":\"Good\",\"value\":{\"floatValue\":1}},{\"definition\":\"Poor\",\"value\":{\"floatValue\":0}}]}},{\"customMetricDefinition\":{\"name\":\"BadMetric\",\"instructions\":\"You are an expert evaluator. Your task is to assess if the sentiment of the response is negative. Rate the response based on whether it conveys negative sentiment, unhelpfulness, or destructive tone.\\n\\nConsider the following:\\n- Does the response have a negative, discouraging tone?\\n- Is the response unhelpful or destructive?\\n- Does it use negative language or harsh criticism?\\n\\nRate on this scale:\\n- Bad: Response has negative sentiment\\n- Good: Response lacks negative sentiment\\n\\nHere is the actual task:\\nPrompt: {{prompt}}\\nResponse: {{prediction}}\",\"ratingScale\":[{\"definition\":\"Bad\",\"value\":{\"floatValue\":1}},{\"definition\":\"Good\",\"value\":{\"floatValue\":0}}]}}]"

# # Create evaluator that only evaluates the custom model (matching test script exactly)
# evaluator = LLMAsJudgeEvaluator(
#     base_model=BASE_MODEL,
#     evaluator_model="anthropic.claude-3-5-haiku-20241022-v1:0",
#     dataset=DATASET,
#     builtin_metrics=["Completeness", "Faithfulness", "Helpfulness"],
#     custom_metrics=custom_metrics,
#     mlflow_resource_arn=MLFLOW_ARN,
#     model_package_group=MODEL_PACKAGE_GROUP,
#     model_artifact=MODEL_ARTIFACT,
#     s3_output_path=S3_BUCKET,
#     evaluate_base_model=False,  # KEY: Skip base model evaluation
# )

# print("✅ Created evaluator (custom model only)")
# pprint(evaluator)

Step 3: Run LLM-as-Judge Evaluation#

Start the evaluation job. The evaluator will:

  1. Generate inference responses from the base model (if evaluate_base_model=True)

  2. Generate inference responses from the custom model

  3. Use the judge model to evaluate responses with built-in and custom metrics

# Run evaluation
execution = evaluator.evaluate()

print(f"✅ Evaluation job started!")
print(f"Job ARN: {execution.arn}")
print(f"Job Name: {execution.name}")
print(f"Status: {execution.status.overall_status}")

pprint(execution)

Step 4: Check Job Status#

Refresh and display the current job status with step details.

# Refresh status
execution.refresh()

# Display job status using rich pprint
pprint(execution.status)

Step 5: Monitor Pipeline Execution#

Poll the pipeline status until it reaches a terminal state (Succeeded, Failed, or Stopped).

# Wait for job completion (optional)
# This will poll every 5 seconds for up to 1 hour
execution.wait(poll=5, timeout=3600)
# Display results
execution.show_results(limit=10, offset=0, show_explanations=False)

Retrieve an Existing Job#

You can retrieve and inspect any existing evaluation job using its ARN.

# Get an existing job by ARN
# Replace with your actual pipeline execution ARN
existing_arn = 'arn:aws:sagemaker:us-west-2:<>:pipeline/SagemakerEvaluation-llmasjudge/execution/4hr7446yft1d'  # or use a specific ARN

from sagemaker.train.evaluate import EvaluationPipelineExecution
from rich.pretty import pprint

existing_execution = EvaluationPipelineExecution.get(
    arn=existing_arn,
    region="us-west-2"
)
pprint(existing_execution.status)

existing_execution.show_results(limit=5, offset=0, show_explanations=False)

Get All LLM-as-Judge Evaluations#

Retrieve all LLM-as-Judge evaluation jobs.

from sagemaker.train.evaluate import LLMAsJudgeEvaluator

# Get all LLM-as-Judge evaluations as an iterator
all_executions = list(LLMAsJudgeEvaluator.get_all(region="us-west-2"))

print(f"Found {len(all_executions)} LLM-as-Judge evaluation jobs")
for execution in all_executions:
    print(f"  - {execution.name}: {execution.status.overall_status}")

Stop a Running Job (Optional)#

If needed, you can stop a running evaluation job.

# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

Dataset Support#

The dataset parameter supports two formats:

1. S3 URI#

dataset="s3://my-bucket/path/to/dataset.jsonl"

2. Dataset ARN (AI Registry)#

dataset="arn:aws:sagemaker:us-west-2:123456789012:hub-content/AIRegistry/DataSet/my-dataset/1.0.0"

The evaluator automatically detects which format is provided and uses the appropriate data source configuration.


Part 2: LLM-as-Judge with Custom Models#

The following sections demonstrate evaluating fine-tuned Nova models. The SDK transparently handles custom models by routing through an InspectAI-based inference path.

Configuration#

Set your AWS region, S3 bucket, and model identifiers below.

# AWS Configuration
REGION = "us-east-1"
S3_BUCKET = "s3://<your-bucket>/llmaj-custom-model-eval/"

# Model Package ARN for your fine-tuned model (Section A)
MODEL_PACKAGE_ARN = "arn:aws:sagemaker:us-east-1:<account-id>:model-package/<package-name>/<version>"

# Nova model for auto-routed Bedrock evaluation (Section B)
NOVA_MODEL = "nova-textgeneration-lite"

# Judge model (evaluator) — used in both sections
EVALUATOR_MODEL = "amazon.nova-pro-v1:0"

# Dataset — S3 URI to a JSONL file with "prompt" or "query" field per line
DATASET = "s3://<your-bucket>/datasets/eval_prompts.jsonl"

# Optional: MLflow tracking server ARN
MLFLOW_ARN = "arn:aws:sagemaker:us-east-1:<account-id>:mlflow-tracking-server/<server-name>"

Imports and Setup#

import json
from sagemaker.train.evaluate import LLMAsJudgeEvaluator, EvaluationPipelineExecution
from rich.pretty import pprint

# Configure logging to show INFO messages
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

Note: LLM-as-Judge paired with InspectAI inference is supported for Nova models only.


Section A: Evaluate a Fine-Tuned Model via Model Package ARN#

When you pass a Model Package ARN as the model parameter, the SDK automatically:

  1. Detects it as a custom model

  2. Resolves model artifacts (model data URI and inference image) from the model package

  3. Routes through the InspectAI inference path — deploying a temporary endpoint, running inference, then cleaning up

  4. Passes the inference output to the LLM-as-Judge Phase 2 for scoring

No additional configuration needed — just use the same API as with JumpStart models.

# Create evaluator with a fine-tuned model package
evaluator_finetuned = LLMAsJudgeEvaluator(
    model=MODEL_PACKAGE_ARN,
    evaluator_model=EVALUATOR_MODEL,
    dataset=DATASET,
    builtin_metrics=["Correctness", "Helpfulness", "Faithfulness"],
    mlflow_resource_arn=MLFLOW_ARN,
    s3_output_path=S3_BUCKET,
)

pprint(evaluator_finetuned)
# Start the evaluation
execution_a = evaluator_finetuned.evaluate()

print(f"Evaluation started!")
print(f"  ARN: {execution_a.arn}")
print(f"  Name: {execution_a.name}")
print(f"  Status: {execution_a.status.overall_status}")
# Wait for completion (polls every 30 seconds, timeout after 2 hours)
execution_a.wait(poll=30, timeout=7200)

print(f"Final status: {execution_a.status.overall_status}")
# View evaluation results
execution_a.show_results(limit=10, offset=0, show_explanations=False)

Section B: Evaluate a Nova Model (Auto-Routed to Bedrock)#

Nova JumpStart models are automatically routed through the InspectAI+Bedrock inference path. No special configuration is needed — just pass the Nova model name as model.

The SDK automatically:

  1. Detects the model is a Nova JumpStart model

  2. Derives the correct Bedrock cross-region inference profile ID from your session region

  3. Routes through InspectAI to call Bedrock for inference

  4. Passes responses to the LLM-as-Judge Phase 2 for scoring

The same API works for all model types — the routing is completely transparent.

# Create evaluator with a Nova model — auto-routes to InspectAI+Bedrock
evaluator_nova = LLMAsJudgeEvaluator(
    model=NOVA_MODEL,
    evaluator_model=EVALUATOR_MODEL,
    dataset=DATASET,
    builtin_metrics=["Correctness", "Helpfulness"],
    mlflow_resource_arn=MLFLOW_ARN,
    s3_output_path=S3_BUCKET,
)

pprint(evaluator_nova)
# Start the evaluation
execution_b = evaluator_nova.evaluate()

print(f"Evaluation started!")
print(f"  ARN: {execution_b.arn}")
print(f"  Name: {execution_b.name}")
print(f"  Status: {execution_b.status.overall_status}")
# Wait for completion
execution_b.wait(poll=30, timeout=7200)

print(f"Final status: {execution_b.status.overall_status}")
# View evaluation results
execution_b.show_results(limit=10, offset=0, show_explanations=False)

Section C: Monitor and Retrieve Existing Evaluations#

You can retrieve any existing evaluation by its ARN, or list all LLM-as-Judge evaluations.

Retrieve a specific evaluation by ARN#

# Replace with your actual pipeline execution ARN
existing_arn = "arn:aws:sagemaker:us-east-1:<account-id>:pipeline/<pipeline-name>/execution/<execution-id>"

existing_execution = EvaluationPipelineExecution.get(
    arn=existing_arn,
    region=REGION
)

pprint(existing_execution.status)
existing_execution.show_results(limit=5, offset=0, show_explanations=False)

List all LLM-as-Judge evaluations#

# Get all LLM-as-Judge evaluations
all_executions = list(LLMAsJudgeEvaluator.get_all(region=REGION))

print(f"Found {len(all_executions)} LLM-as-Judge evaluation jobs")
for exec_item in all_executions:
    print(f"  - {exec_item.name}: {exec_item.status.overall_status}")

Refresh status of a running evaluation#

# Refresh and display step-level status
execution_b.refresh()
pprint(execution_b.status)

Cost Implications#

When using custom models or Nova models, the evaluation runs through the InspectAI inference path. This incurs the following costs:

Cost Component

Description

InspectAI Orchestrator

A SageMaker Training instance (ml.m5.large, ~$0.12/hr) runs the InspectAI container that orchestrates inference. This instance runs for the duration of inference generation.

Inference Costs

Depends on your model type:

— Nova model (auto-routed)

Standard Bedrock per-token pricing for the Nova model. The SDK derives the correct Bedrock inference profile from your region.

— Fine-tuned model (Model Package)

A temporary SageMaker endpoint is created for inference and automatically cleaned up after completion. You are charged for the endpoint instance time.

LLM-as-Judge (Phase 2)

Standard Bedrock pricing for the judge model (evaluator_model) to score responses. This cost is the same as the standard LLMAJ path.

Tips to manage costs:

  • Use a small dataset (5-20 samples) for initial testing

  • Choose cost-effective models: nova-textgeneration-lite for inference, amazon.nova-pro-v1:0 for judging

  • The InspectAI orchestrator instance is minimal cost compared to inference and judging

  • Endpoints are automatically cleaned up — no manual intervention needed


Dataset Format#

Your evaluation dataset should be a JSONL file where each line contains a "prompt" or "query" field:

{"prompt": "What is machine learning?"}
{"prompt": "Explain the difference between supervised and unsupervised learning."}
{"prompt": "How does a neural network work?"}

The SDK automatically converts this to the InspectAI format internally. No manual reformatting is needed.