sagemaker.train.evaluate.inspect_ai_evaluator#
InspectAI Evaluator for SageMaker Model Evaluation Module.
This module provides evaluation capabilities using InspectAI as a backend, enabling a broad set of benchmarks and methodologies via the InspectAI framework. The evaluator runs InspectAI tasks inside a dedicated container on SageMaker Training infrastructure.
Classes
|
InspectAI evaluation job. |
- class sagemaker.train.evaluate.inspect_ai_evaluator.InspectAIEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | AgentRFTJob | ModelPackage, base_model_name: str | None = None, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, compute: Compute | HyperPodCompute | None = None, training_image: str | None = None, recipe: str | None = None, overrides: Dict[str, Any] | None = None, benchmarks_path: str, tasks: List[Dict[str, Any]] | None = None, output_format: str | None = None, bedrock_model_id: str | None = None, endpoint_name: str | None = None, model_s3_uri: str | None = None, inference_image_uri: str | None = None, endpoint_instance_type: str | None = None, endpoint_instance_count: int = 1, endpoint_execution_role_arn: str | None = None, context_length: str | None = None, max_concurrency: str | None = None, cleanup_endpoint: bool = True, endpoint_prefix: str = 'inspectai', endpoint_environment: Dict[str, str] | None = None, extra_args: List[str] | None = None, environment: Dict[str, str] | None = None, image_uri: str | None = None, instance_type: str = 'ml.m5.large', max_runtime_seconds: int = 86400, max_connections: int = 16, max_retries: int = 100, timeout: int = 600, temperature: float = 0.0, top_p: float = 1.0, top_k: int = -1, max_tokens: int = 8192)[source]#
Bases:
BaseEvaluatorInspectAI evaluation job.
Runs InspectAI tasks inside a SageMaker Training container, supporting three inference provider modes: Bedrock, existing SageMaker endpoint, or creating a new endpoint.
The evaluator serializes configuration to a YAML file (
inspect_config.yaml), uploads it to S3, and launches a single-step SageMaker Pipeline that runs the InspectAI container with that config as input.Supports resource chaining: a completed trainer (e.g.,
SFTTrainer,DPOTrainer,MultiTurnRLTrainer) can be passed directly as themodelparameter. The evaluator will automatically resolve the trainer’s output model package artifacts and configure endpoint creation for evaluation.- benchmarks_path#
S3 URI pointing to benchmark
.pyfiles with@taskdecorators. Required.- Type:
str
- tasks#
List of task configurations. Each dict must have a
"name"key. Optional keys:"path"(must end with .py),"limit"(int >= 1),"epochs"(int >= 1),"task_args"(dict). If None or empty, all tasks atbenchmarks_pathare run.- Type:
Optional[List[Dict[str, Any]]]
- output_format#
Output format for results. One of
"eval","csv","jsonl","json".- Type:
Optional[str]
- bedrock_model_id#
Explicit Bedrock model ID for bedrock inference mode. Falls back to the model’s bedrock_model_id if not set.
- Type:
Optional[str]
- endpoint_name#
Existing SageMaker endpoint name. Mutually exclusive with
model_s3_uri/inference_image_uri.- Type:
Optional[str]
- model_s3_uri#
S3 URI of model artifacts for creating a new endpoint. Must be paired with
inference_image_uri.- Type:
Optional[str]
- inference_image_uri#
ECR image URI for creating a new endpoint. Must be paired with
model_s3_uri.- Type:
Optional[str]
- endpoint_instance_type#
Instance type for new endpoint (must start with
ml.).- Type:
Optional[str]
- endpoint_instance_count#
Instance count for new endpoint. Defaults to 1.
- Type:
int
- endpoint_execution_role_arn#
IAM role ARN for new endpoint.
- Type:
Optional[str]
- context_length#
Context length as string integer.
- Type:
Optional[str]
- max_concurrency#
Max concurrency as string integer.
- Type:
Optional[str]
- cleanup_endpoint#
Delete endpoint after evaluation. Defaults to True.
- Type:
bool
- endpoint_prefix#
Prefix for auto-created endpoint names.
- Type:
str
- endpoint_environment#
Env vars for the inference endpoint container.
- Type:
Optional[Dict[str, str]]
- extra_args#
Additional CLI args forwarded to
inspect eval.- Type:
Optional[List[str]]
- environment#
Env vars for the SageMaker Training Job container.
- Type:
Optional[Dict[str, str]]
- image_uri#
Override for the InspectAI container image URI.
- Type:
Optional[str]
- instance_type#
Instance type for the orchestrator Training Job (CPU-only). Defaults to
"ml.m5.large".- Type:
str
- max_runtime_seconds#
Max runtime for the Training Job in seconds. Defaults to 86400 (24 hours).
- Type:
int
- max_connections#
Max concurrent inference connections used by the InspectAI eval runner. Defaults to 16.
- Type:
int
- max_retries#
Max retries per inference request. Defaults to 100.
- Type:
int
- timeout#
Per-request timeout in seconds. Defaults to 600.
- Type:
int
- temperature#
Sampling temperature in [0.0, 2.0]. Defaults to 0.0.
- Type:
float
- top_p#
Nucleus sampling cutoff in [0.0, 1.0]. Defaults to 1.0.
- Type:
float
- top_k#
Top-k sampling cutoff. Use
-1to disable. Defaults to -1.- Type:
int
- max_tokens#
Max tokens to generate per response. Defaults to 8192.
- Type:
int
Example
from sagemaker.train.evaluate import InspectAIEvaluator evaluator = InspectAIEvaluator( model="amazon-nova-lite-v1", benchmarks_path="s3://my-bucket/benchmarks/", tasks=[{"name": "boolq_pt", "limit": 10}], s3_output_path="s3://my-bucket/eval-output/", ) execution = evaluator.evaluate() execution.wait() execution.show_results()
Resource chaining with a trainer:
from sagemaker.train import SFTTrainer from sagemaker.train.evaluate import InspectAIEvaluator # Train a model trainer = SFTTrainer(model="llama3-2-1b-instruct", ...) trainer.train(training_dataset="s3://bucket/data.jsonl") # Evaluate the fine-tuned model directly evaluator = InspectAIEvaluator( model=trainer, benchmarks_path="s3://my-bucket/benchmarks/", tasks=[{"name": "boolq_pt", "limit": 10}], s3_output_path="s3://my-bucket/eval-output/", ) execution = evaluator.evaluate()
- base_eval_name: str | None#
- base_model_name: str | None#
- bedrock_model_id: str | None#
- benchmarks_path: str#
- cleanup_endpoint: bool#
- compute: Compute | HyperPodCompute | None#
- context_length: str | None#
- endpoint_environment: Dict[str, str] | None#
- endpoint_execution_role_arn: str | None#
- endpoint_instance_count: int#
- endpoint_instance_type: str | None#
- endpoint_name: str | None#
- endpoint_prefix: str#
- environment: Dict[str, str] | None#
- evaluate() EvaluationPipelineExecution[source]#
Create and start an InspectAI evaluation job.
Serializes the InspectAI configuration to YAML, uploads it to S3, and launches a single-step SageMaker Pipeline with the InspectAI container.
- Returns:
- The started evaluation execution with
.wait(),.refresh(), and.show_results()methods.
- Return type:
Example
evaluator = InspectAIEvaluator( model="amazon-nova-lite-v1", benchmarks_path="s3://my-bucket/benchmarks/", tasks=[{"name": "boolq_pt", "limit": 10}], s3_output_path="s3://my-bucket/eval-output/", ) execution = evaluator.evaluate() execution.wait() execution.show_results()
- extra_args: List[str] | None#
- classmethod get_all(session: Any | None = None, region: str | None = None) Iterator[EvaluationPipelineExecution][source]#
Get all InspectAI evaluation executions.
- Parameters:
session (Optional[Any]) – Optional boto3 session.
region (Optional[str]) – Optional AWS region.
- Yields:
EvaluationPipelineExecution – InspectAI evaluation execution instances.
- image_uri: str | None#
- inference_image_uri: str | None#
- instance_type: str#
- kms_key_id: str | None#
- max_concurrency: str | None#
- max_connections: int#
- max_retries: int#
- max_runtime_seconds: int#
- max_tokens: int#
- mlflow_experiment_name: str | None#
- mlflow_resource_arn: str | None#
- mlflow_run_name: str | None#
- model: str | BaseTrainer | AgentRFTJob | ModelPackage#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_package_group: str | ModelPackageGroup | None#
- model_s3_uri: str | None#
- output_format: str | None#
- overrides: Dict[str, Any] | None#
- recipe: str | None#
- region: str | None#
- role: str | None#
- s3_output_path: str#
- sagemaker_session: Any | None#
- tasks: List[Dict[str, Any]] | None#
- temperature: float#
- timeout: int#
- top_k: int#
- top_p: float#
- training_image: str | None#
- upload_benchmarks(local_path: str) str[source]#
Upload local benchmark files to S3.
Uploads all files from a local directory to an S3 prefix under the configured output path. The uploaded path can be used as
benchmarks_path.- Parameters:
local_path – Local directory path containing
.pyfiles with@taskdecorators.- Returns:
S3 URI prefix where benchmarks were uploaded.
- Raises:
ValueError – If local_path does not exist or is not a directory.