SageMaker InspectAI Evaluation#

This notebook demonstrates how to use the InspectAIEvaluator to run InspectAI evaluation tasks on SageMaker infrastructure.

InspectAI is an open-source framework for evaluating LLMs with a broad set of benchmarks and methodologies. The InspectAIEvaluator runs InspectAI tasks inside a dedicated container on SageMaker Training, supporting Bedrock inference, existing SageMaker endpoints, or auto-created endpoints.

Setup#

Import necessary modules and configure logging.

# Configure AWS region and credentials via: aws configure, IAM Identity Center (aws sso login), or environment variables.
#! aws configure set region us-east-1
from sagemaker.train.evaluate import InspectAIEvaluator
from rich.pretty import pprint

import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

Step 1: Prepare Benchmarks#

InspectAI benchmarks are Python files containing @task decorated functions. Each task defines a dataset, solver, and scorer.

Example benchmark file (boolq_pt.py):

from inspect_ai import task, Task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice

@task
def boolq_pt():
    return Task(
        dataset=json_dataset("boolq_data.json"),
        solver=[multiple_choice()],
        scorer=choice(),
    )

Your benchmark directory should also include a pyproject.toml and any data files referenced by the task.

Option A: Use benchmarks already in S3#

If your benchmark files are already uploaded to S3, just point to them:

# Point to existing benchmarks in S3
BENCHMARKS_PATH = "s3://<your-bucket>/benchmarks/boolq/"

Option B: Upload local benchmarks to S3#

Use upload_benchmarks() to upload a local directory of InspectAI task files.

# Upload local benchmarks to S3
evaluator = InspectAIEvaluator(
    model="nova-textgeneration-lite",
    bedrock_model_id="us.amazon.nova-lite-v1:0",
    benchmarks_path="s3://<your-bucket>/benchmarks/",  # Will be overwritten
    s3_output_path="s3://<your-bucket>/eval-output/",
)
BENCHMARKS_PATH = evaluator.upload_benchmarks("/path/to/local/benchmarks/boolq/")
print(f"Benchmarks uploaded to: {BENCHMARKS_PATH}")

Step 2: Create InspectAIEvaluator#

Create an evaluator instance. The evaluator supports three inference modes:

  1. Bedrock (default) — use bedrock_model_id

  2. Existing SageMaker endpoint — use endpoint_name

  3. Create new endpoint — use model_s3_uri + inference_image_uri

# Configuration
REGION = "us-east-1"
S3_OUTPUT_PATH = "s3://<your-bucket>/inspectai-eval-output/"
# Create evaluator with Bedrock inference
evaluator = InspectAIEvaluator(
    model="nova-textgeneration-lite",
    bedrock_model_id="us.amazon.nova-lite-v1:0",
    benchmarks_path=BENCHMARKS_PATH,
    tasks=[{"name": "boolq_pt", "limit": 10}],
    s3_output_path=S3_OUTPUT_PATH,
    instance_type="ml.m5.large",
    region=REGION,
)

print("InspectAIEvaluator created successfully")
pprint(evaluator)

Task Configuration Options#

The tasks parameter accepts a list of dicts with:

  • name (required): Task function name

  • path (optional): Path to .py file within benchmarks directory

  • limit (optional): Max samples to evaluate

  • epochs (optional): Number of evaluation epochs

  • task_args (optional): Dict of additional task arguments

If tasks is omitted, all tasks found in benchmarks_path are run.

Alternative: Using an Existing SageMaker Endpoint#

# # Evaluate using an existing SageMaker endpoint
# evaluator = InspectAIEvaluator(
#     model="nova-textgeneration-lite",
#     endpoint_name="<your-endpoint-name>",
#     benchmarks_path=BENCHMARKS_PATH,
#     tasks=[{"name": "boolq_pt"}],
#     s3_output_path=S3_OUTPUT_PATH,
#     role="arn:aws:iam::<account-id>:role/<your-execution-role>",
#     region=REGION,
# )

Alternative: Auto-Create a New Endpoint#

# # Evaluate by creating a new endpoint from model artifacts
# evaluator = InspectAIEvaluator(
#     model="my-model",
#     model_s3_uri="s3://<your-bucket>/model-artifacts/model.tar.gz",
#     inference_image_uri="<account-id>.dkr.ecr.<region>.amazonaws.com/<image-name>:<tag>",
#     endpoint_instance_type="ml.g5.xlarge",
#     benchmarks_path=BENCHMARKS_PATH,
#     tasks=[{"name": "boolq_pt"}],
#     s3_output_path=S3_OUTPUT_PATH,
#     cleanup_endpoint=True,  # Delete endpoint after evaluation
#     region=REGION,
# )

Step 3: Tune Decoding Parameters (Optional)#

Adjust generation parameters before running the evaluation.

# View current decoding settings
print(f"Temperature: {evaluator.temperature}")
print(f"Top-p: {evaluator.top_p}")
print(f"Top-k: {evaluator.top_k}")
print(f"Max tokens: {evaluator.max_tokens}")
print(f"Max connections: {evaluator.max_connections}")
print(f"Timeout: {evaluator.timeout}s")

Step 4: Run Evaluation#

Call evaluate() to start the InspectAI evaluation job. This will:

  1. Build a YAML configuration file

  2. Upload it to S3

  3. Launch a SageMaker Pipeline with the InspectAI container

# Start evaluation
execution = evaluator.evaluate()

print(f"\n✓ Evaluation started!")
print(f"  Execution ARN: {execution.arn}")
print(f"  Status: {execution.status.overall_status}")

Step 5: Monitor Progress#

Use refresh() for manual status updates or wait() to block until completion.

# Check current status
execution.refresh()
pprint(execution.status)
# Wait for completion
execution.wait(target_status="Succeeded", poll=30, timeout=3600)

print(f"\nFinal Status: {execution.status.overall_status}")

Step 6: View Results#

execution.show_results()

Retrieve an Existing Evaluation#

You can retrieve a previously started evaluation job using its ARN.

from sagemaker.train.evaluate import EvaluationPipelineExecution

# Retrieve by ARN
existing_arn = execution.arn  # Or paste a specific ARN
existing_exec = EvaluationPipelineExecution.get(arn=existing_arn, region=REGION)

print(f"Retrieved: {existing_exec.name}")
print(f"Status: {existing_exec.status.overall_status}")

List All InspectAI Evaluations#

all_executions = list(InspectAIEvaluator.get_all(region=REGION))

print(f"Found {len(all_executions)} InspectAI evaluation(s):\n")
for exec in all_executions[:10]:
    print(f"  - {exec.name}: {exec.status.overall_status}")

Stop a Running Job (Optional)#

# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")

Summary#

Inference Modes#

Mode

Parameters

Use Case

Bedrock

bedrock_model_id

Easiest — no endpoint management

Existing endpoint

endpoint_name

Re-use a running endpoint

Create endpoint

model_s3_uri + inference_image_uri

Custom models not on Bedrock

Key Parameters#

  • benchmarks_path: S3 URI to your InspectAI .py task files

  • tasks: List of task configurations (name, limit, epochs)

  • temperature, top_p, top_k, max_tokens: Decoding tunables

  • max_connections: Concurrent inference connections (default 16)

  • instance_type: Orchestrator instance type (default ml.m5.large)