sagemaker.train.evaluate.inspect_ai_evaluator

Contents

sagemaker.train.evaluate.inspect_ai_evaluator#

InspectAI Evaluator for SageMaker Model Evaluation Module.

This module provides evaluation capabilities using InspectAI as a backend, enabling a broad set of benchmarks and methodologies via the InspectAI framework. The evaluator runs InspectAI tasks inside a dedicated container on SageMaker Training infrastructure.

Classes

InspectAIEvaluator(*[, region, role, ...])

InspectAI evaluation job.

class sagemaker.train.evaluate.inspect_ai_evaluator.InspectAIEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | AgentRFTJob | ModelPackage, base_model_name: str | None = None, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, compute: Compute | HyperPodCompute | None = None, training_image: str | None = None, recipe: str | None = None, overrides: Dict[str, Any] | None = None, benchmarks_path: str, tasks: List[Dict[str, Any]] | None = None, output_format: str | None = None, bedrock_model_id: str | None = None, endpoint_name: str | None = None, model_s3_uri: str | None = None, inference_image_uri: str | None = None, endpoint_instance_type: str | None = None, endpoint_instance_count: int = 1, endpoint_execution_role_arn: str | None = None, context_length: str | None = None, max_concurrency: str | None = None, cleanup_endpoint: bool = True, endpoint_prefix: str = 'inspectai', endpoint_environment: Dict[str, str] | None = None, extra_args: List[str] | None = None, environment: Dict[str, str] | None = None, image_uri: str | None = None, instance_type: str = 'ml.m5.large', max_runtime_seconds: int = 86400, max_connections: int = 16, max_retries: int = 100, timeout: int = 600, temperature: float = 0.0, top_p: float = 1.0, top_k: int = -1, max_tokens: int = 8192)[source]#

Bases: BaseEvaluator

InspectAI evaluation job.

Runs InspectAI tasks inside a SageMaker Training container, supporting three inference provider modes: Bedrock, existing SageMaker endpoint, or creating a new endpoint.

The evaluator serializes configuration to a YAML file (inspect_config.yaml), uploads it to S3, and launches a single-step SageMaker Pipeline that runs the InspectAI container with that config as input.

Supports resource chaining: a completed trainer (e.g., SFTTrainer, DPOTrainer, MultiTurnRLTrainer) can be passed directly as the model parameter. The evaluator will automatically resolve the trainer’s output model package artifacts and configure endpoint creation for evaluation.

benchmarks_path#

S3 URI pointing to benchmark .py files with @task decorators. Required.

Type:

str

tasks#

List of task configurations. Each dict must have a "name" key. Optional keys: "path" (must end with .py), "limit" (int >= 1), "epochs" (int >= 1), "task_args" (dict). If None or empty, all tasks at benchmarks_path are run.

Type:

Optional[List[Dict[str, Any]]]

output_format#

Output format for results. One of "eval", "csv", "jsonl", "json".

Type:

Optional[str]

bedrock_model_id#

Explicit Bedrock model ID for bedrock inference mode. Falls back to the model’s bedrock_model_id if not set.

Type:

Optional[str]

endpoint_name#

Existing SageMaker endpoint name. Mutually exclusive with model_s3_uri/inference_image_uri.

Type:

Optional[str]

model_s3_uri#

S3 URI of model artifacts for creating a new endpoint. Must be paired with inference_image_uri.

Type:

Optional[str]

inference_image_uri#

ECR image URI for creating a new endpoint. Must be paired with model_s3_uri.

Type:

Optional[str]

endpoint_instance_type#

Instance type for new endpoint (must start with ml.).

Type:

Optional[str]

endpoint_instance_count#

Instance count for new endpoint. Defaults to 1.

Type:

int

endpoint_execution_role_arn#

IAM role ARN for new endpoint.

Type:

Optional[str]

context_length#

Context length as string integer.

Type:

Optional[str]

max_concurrency#

Max concurrency as string integer.

Type:

Optional[str]

cleanup_endpoint#

Delete endpoint after evaluation. Defaults to True.

Type:

bool

endpoint_prefix#

Prefix for auto-created endpoint names.

Type:

str

endpoint_environment#

Env vars for the inference endpoint container.

Type:

Optional[Dict[str, str]]

extra_args#

Additional CLI args forwarded to inspect eval.

Type:

Optional[List[str]]

environment#

Env vars for the SageMaker Training Job container.

Type:

Optional[Dict[str, str]]

image_uri#

Override for the InspectAI container image URI.

Type:

Optional[str]

instance_type#

Instance type for the orchestrator Training Job (CPU-only). Defaults to "ml.m5.large".

Type:

str

max_runtime_seconds#

Max runtime for the Training Job in seconds. Defaults to 86400 (24 hours).

Type:

int

max_connections#

Max concurrent inference connections used by the InspectAI eval runner. Defaults to 16.

Type:

int

max_retries#

Max retries per inference request. Defaults to 100.

Type:

int

timeout#

Per-request timeout in seconds. Defaults to 600.

Type:

int

temperature#

Sampling temperature in [0.0, 2.0]. Defaults to 0.0.

Type:

float

top_p#

Nucleus sampling cutoff in [0.0, 1.0]. Defaults to 1.0.

Type:

float

top_k#

Top-k sampling cutoff. Use -1 to disable. Defaults to -1.

Type:

int

max_tokens#

Max tokens to generate per response. Defaults to 8192.

Type:

int

Example

from sagemaker.train.evaluate import InspectAIEvaluator

evaluator = InspectAIEvaluator(
    model="amazon-nova-lite-v1",
    benchmarks_path="s3://my-bucket/benchmarks/",
    tasks=[{"name": "boolq_pt", "limit": 10}],
    s3_output_path="s3://my-bucket/eval-output/",
)
execution = evaluator.evaluate()
execution.wait()
execution.show_results()

Resource chaining with a trainer:

from sagemaker.train import SFTTrainer
from sagemaker.train.evaluate import InspectAIEvaluator

# Train a model
trainer = SFTTrainer(model="llama3-2-1b-instruct", ...)
trainer.train(training_dataset="s3://bucket/data.jsonl")

# Evaluate the fine-tuned model directly
evaluator = InspectAIEvaluator(
    model=trainer,
    benchmarks_path="s3://my-bucket/benchmarks/",
    tasks=[{"name": "boolq_pt", "limit": 10}],
    s3_output_path="s3://my-bucket/eval-output/",
)
execution = evaluator.evaluate()
base_eval_name: str | None#
base_model_name: str | None#
bedrock_model_id: str | None#
benchmarks_path: str#
cleanup_endpoint: bool#
compute: Compute | HyperPodCompute | None#
context_length: str | None#
endpoint_environment: Dict[str, str] | None#
endpoint_execution_role_arn: str | None#
endpoint_instance_count: int#
endpoint_instance_type: str | None#
endpoint_name: str | None#
endpoint_prefix: str#
environment: Dict[str, str] | None#
evaluate() EvaluationPipelineExecution[source]#

Create and start an InspectAI evaluation job.

Serializes the InspectAI configuration to YAML, uploads it to S3, and launches a single-step SageMaker Pipeline with the InspectAI container.

Returns:

The started evaluation execution with

.wait(), .refresh(), and .show_results() methods.

Return type:

EvaluationPipelineExecution

Example

evaluator = InspectAIEvaluator(
    model="amazon-nova-lite-v1",
    benchmarks_path="s3://my-bucket/benchmarks/",
    tasks=[{"name": "boolq_pt", "limit": 10}],
    s3_output_path="s3://my-bucket/eval-output/",
)
execution = evaluator.evaluate()
execution.wait()
execution.show_results()
extra_args: List[str] | None#
classmethod get_all(session: Any | None = None, region: str | None = None) Iterator[EvaluationPipelineExecution][source]#

Get all InspectAI evaluation executions.

Parameters:
  • session (Optional[Any]) – Optional boto3 session.

  • region (Optional[str]) – Optional AWS region.

Yields:

EvaluationPipelineExecution – InspectAI evaluation execution instances.

image_uri: str | None#
inference_image_uri: str | None#
instance_type: str#
kms_key_id: str | None#
max_concurrency: str | None#
max_connections: int#
max_retries: int#
max_runtime_seconds: int#
max_tokens: int#
mlflow_experiment_name: str | None#
mlflow_resource_arn: str | None#
mlflow_run_name: str | None#
model: str | BaseTrainer | AgentRFTJob | ModelPackage#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_package_group: str | ModelPackageGroup | None#
model_s3_uri: str | None#
networking: VpcConfig | None#
output_format: str | None#
overrides: Dict[str, Any] | None#
recipe: str | None#
region: str | None#
role: str | None#
s3_output_path: str#
sagemaker_session: Any | None#
tasks: List[Dict[str, Any]] | None#
temperature: float#
timeout: int#
top_k: int#
top_p: float#
training_image: str | None#
upload_benchmarks(local_path: str) str[source]#

Upload local benchmark files to S3.

Uploads all files from a local directory to an S3 prefix under the configured output path. The uploaded path can be used as benchmarks_path.

Parameters:

local_path – Local directory path containing .py files with @task decorators.

Returns:

S3 URI prefix where benchmarks were uploaded.

Raises:

ValueError – If local_path does not exist or is not a directory.