SageMaker InspectAI Evaluation#
This notebook demonstrates how to use the InspectAIEvaluator to run InspectAI evaluation tasks on SageMaker infrastructure.
InspectAI is an open-source framework for evaluating LLMs with a broad set of benchmarks and methodologies. The InspectAIEvaluator runs InspectAI tasks inside a dedicated container on SageMaker Training, supporting Bedrock inference, existing SageMaker endpoints, or auto-created endpoints.
Setup#
Import necessary modules and configure logging.
# Configure AWS region and credentials via: aws configure, IAM Identity Center (aws sso login), or environment variables.
#! aws configure set region us-east-1
from sagemaker.train.evaluate import InspectAIEvaluator
from rich.pretty import pprint
import logging
logging.basicConfig(
level=logging.INFO,
format='%(levelname)s - %(name)s - %(message)s'
)
Step 1: Prepare Benchmarks#
InspectAI benchmarks are Python files containing @task decorated functions. Each task defines a dataset, solver, and scorer.
Example benchmark file (boolq_pt.py):
from inspect_ai import task, Task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice
@task
def boolq_pt():
return Task(
dataset=json_dataset("boolq_data.json"),
solver=[multiple_choice()],
scorer=choice(),
)
Your benchmark directory should also include a pyproject.toml and any data files referenced by the task.
Option A: Use benchmarks already in S3#
If your benchmark files are already uploaded to S3, just point to them:
# Point to existing benchmarks in S3
BENCHMARKS_PATH = "s3://<your-bucket>/benchmarks/boolq/"
Option B: Upload local benchmarks to S3#
Use upload_benchmarks() to upload a local directory of InspectAI task files.
# Upload local benchmarks to S3
evaluator = InspectAIEvaluator(
model="nova-textgeneration-lite",
bedrock_model_id="us.amazon.nova-lite-v1:0",
benchmarks_path="s3://<your-bucket>/benchmarks/", # Will be overwritten
s3_output_path="s3://<your-bucket>/eval-output/",
)
BENCHMARKS_PATH = evaluator.upload_benchmarks("/path/to/local/benchmarks/boolq/")
print(f"Benchmarks uploaded to: {BENCHMARKS_PATH}")
Step 2: Create InspectAIEvaluator#
Create an evaluator instance. The evaluator supports three inference modes:
Bedrock (default) — use
bedrock_model_idExisting SageMaker endpoint — use
endpoint_nameCreate new endpoint — use
model_s3_uri+inference_image_uri
# Configuration
REGION = "us-east-1"
S3_OUTPUT_PATH = "s3://<your-bucket>/inspectai-eval-output/"
# Create evaluator with Bedrock inference
evaluator = InspectAIEvaluator(
model="nova-textgeneration-lite",
bedrock_model_id="us.amazon.nova-lite-v1:0",
benchmarks_path=BENCHMARKS_PATH,
tasks=[{"name": "boolq_pt", "limit": 10}],
s3_output_path=S3_OUTPUT_PATH,
instance_type="ml.m5.large",
region=REGION,
)
print("InspectAIEvaluator created successfully")
pprint(evaluator)
Task Configuration Options#
The tasks parameter accepts a list of dicts with:
name(required): Task function namepath(optional): Path to.pyfile within benchmarks directorylimit(optional): Max samples to evaluateepochs(optional): Number of evaluation epochstask_args(optional): Dict of additional task arguments
If tasks is omitted, all tasks found in benchmarks_path are run.
Alternative: Using an Existing SageMaker Endpoint#
# # Evaluate using an existing SageMaker endpoint
# evaluator = InspectAIEvaluator(
# model="nova-textgeneration-lite",
# endpoint_name="<your-endpoint-name>",
# benchmarks_path=BENCHMARKS_PATH,
# tasks=[{"name": "boolq_pt"}],
# s3_output_path=S3_OUTPUT_PATH,
# role="arn:aws:iam::<account-id>:role/<your-execution-role>",
# region=REGION,
# )
Alternative: Auto-Create a New Endpoint#
# # Evaluate by creating a new endpoint from model artifacts
# evaluator = InspectAIEvaluator(
# model="my-model",
# model_s3_uri="s3://<your-bucket>/model-artifacts/model.tar.gz",
# inference_image_uri="<account-id>.dkr.ecr.<region>.amazonaws.com/<image-name>:<tag>",
# endpoint_instance_type="ml.g5.xlarge",
# benchmarks_path=BENCHMARKS_PATH,
# tasks=[{"name": "boolq_pt"}],
# s3_output_path=S3_OUTPUT_PATH,
# cleanup_endpoint=True, # Delete endpoint after evaluation
# region=REGION,
# )
Step 3: Tune Decoding Parameters (Optional)#
Adjust generation parameters before running the evaluation.
# View current decoding settings
print(f"Temperature: {evaluator.temperature}")
print(f"Top-p: {evaluator.top_p}")
print(f"Top-k: {evaluator.top_k}")
print(f"Max tokens: {evaluator.max_tokens}")
print(f"Max connections: {evaluator.max_connections}")
print(f"Timeout: {evaluator.timeout}s")
Step 4: Run Evaluation#
Call evaluate() to start the InspectAI evaluation job. This will:
Build a YAML configuration file
Upload it to S3
Launch a SageMaker Pipeline with the InspectAI container
# Start evaluation
execution = evaluator.evaluate()
print(f"\n✓ Evaluation started!")
print(f" Execution ARN: {execution.arn}")
print(f" Status: {execution.status.overall_status}")
Step 5: Monitor Progress#
Use refresh() for manual status updates or wait() to block until completion.
# Check current status
execution.refresh()
pprint(execution.status)
# Wait for completion
execution.wait(target_status="Succeeded", poll=30, timeout=3600)
print(f"\nFinal Status: {execution.status.overall_status}")
Step 6: View Results#
execution.show_results()
Retrieve an Existing Evaluation#
You can retrieve a previously started evaluation job using its ARN.
from sagemaker.train.evaluate import EvaluationPipelineExecution
# Retrieve by ARN
existing_arn = execution.arn # Or paste a specific ARN
existing_exec = EvaluationPipelineExecution.get(arn=existing_arn, region=REGION)
print(f"Retrieved: {existing_exec.name}")
print(f"Status: {existing_exec.status.overall_status}")
List All InspectAI Evaluations#
all_executions = list(InspectAIEvaluator.get_all(region=REGION))
print(f"Found {len(all_executions)} InspectAI evaluation(s):\n")
for exec in all_executions[:10]:
print(f" - {exec.name}: {exec.status.overall_status}")
Stop a Running Job (Optional)#
# Uncomment to stop the job
# execution.stop()
# print(f"Execution stopped. Status: {execution.status.overall_status}")
Summary#
Inference Modes#
Mode |
Parameters |
Use Case |
|---|---|---|
Bedrock |
|
Easiest — no endpoint management |
Existing endpoint |
|
Re-use a running endpoint |
Create endpoint |
|
Custom models not on Bedrock |
Key Parameters#
benchmarks_path: S3 URI to your InspectAI.pytask filestasks: List of task configurations (name, limit, epochs)temperature,top_p,top_k,max_tokens: Decoding tunablesmax_connections: Concurrent inference connections (default 16)instance_type: Orchestrator instance type (defaultml.m5.large)