sagemaker.train.evaluate.multi_turn_rl_evaluator#

MultiTurnRLEvaluator — evaluate MTRL agents on held-out prompts.

This module implements MultiTurnRLEvaluator, the SDK surface for evaluating Multi-Turn Reinforcement Learning (MTRL) agent models via the AgentRFT CreateJob pipeline step. Mirrors the architecture of sagemaker.train.evaluate.BenchMarkEvaluator, with MTRL-specific fields, validators, and the three-template rendering surface defined in sagemaker.train.evaluate.mtrl_pipeline_templates.

Classes

MultiTurnRLEvaluator(*[, region, role, ...])

Evaluate a multi-turn RL agent model against a held-out prompt dataset.

class sagemaker.train.evaluate.multi_turn_rl_evaluator.MultiTurnRLEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | AgentRFTJob | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, dataset: Any, agent_config: Any | None = None, agent_qualifier: str | None = None, accept_eula: bool = True, evaluate_base_model: bool = False, stopping_condition: int = 86400, tags: List[Dict[str, str]] | None = None)[source]#

Bases: BaseEvaluator

Evaluate a multi-turn RL agent model against a held-out prompt dataset.

The evaluator runs rollouts of the agent against an environment (Bedrock AgentCore runtime or a Lambda-wrapped agent) and computes aggregate metrics (pass@k, mean reward, etc.). Execution routes through SageMaker Pipelines using the new AgentRFT Job step type (JobCategory="AgentRFTEvaluation").

The evaluator supports three evaluation shapes, selected automatically based on the provided inputs:

  • Base-model only — pass a base model (JumpStart ID or ModelPackage) with an explicit agent_config.

  • Fine-tuned only — pass a MultiTurnRLTrainer or a fine-tuned ModelPackage; the evaluator extracts the source model package ARN and evaluates it only.

  • Base + fine-tuned comparison — pass evaluate_base_model=True along with a fine-tuned trainer / ModelPackage; both runs land in the same MLflow experiment for side-by-side comparison.

dataset#

Prompt dataset — S3 URI, hub-content DataSet ARN, or object exposing an .arn attribute. Required.

Type:

Union[str, Any]

agent_config#

Agent environment — Bedrock AgentCore ARN or Lambda ARN. Auto-resolved from a MultiTurnRLTrainer when provided as model.

Type:

Optional[Union[str, Any]]

agent_qualifier#

Bedrock AgentCore qualifier (e.g. "PROD"). Ignored when agent_config is a Lambda.

Type:

Optional[str]

accept_eula#

Forwarded to JobConfigDocument.EvaluationConfig.AcceptEula. Defaults to True (templates emit true unconditionally; flag kept for future backend schemas).

Type:

bool

evaluate_base_model#

When True and a fine-tuned model is present, render the comparison template (both base and fine-tuned are evaluated). Defaults to False — fine-tuned only.

Type:

bool

stopping_condition#

Maximum job duration in seconds. Default 86400 (24 hours); must be in (0, 259200].

Type:

int

tags#

Customer tags propagated to the pipeline + step Tags list.

Type:

Optional[List[Dict[str, str]]]

See :class:`BaseEvaluator` for inherited fields (``model``,
``s3_output_path``, ``mlflow_resource_arn``,
``mlflow_experiment_name``, ``networking``, ``kms_key_id``,
``model_package_group``, ``base_eval_name``, ``region``, ``role``,
``sagemaker_session``).

Example

from sagemaker.train.evaluate import MultiTurnRLEvaluator

# Evaluate a fine-tuned MTRL trainer output
evaluator = MultiTurnRLEvaluator(
    model=completed_mtrl_trainer,
    dataset='s3://my-bucket/eval-prompts.jsonl',
    s3_output_path='s3://my-bucket/mtrl-eval-output/',
)

execution = evaluator.evaluate()
execution.wait()
execution.show_results()
accept_eula: bool#
agent_config: Any | None#
agent_qualifier: str | None#
base_eval_name: str | None#
dataset: Any#
evaluate() MTRLEvaluationExecution[source]#

Render the MTRL pipeline and start a non-blocking execution.

Returns:

The started pipeline execution. Call .wait() to block until completion and .show_results() to render the aggregate report.

Return type:

MTRLEvaluationExecution

Example

execution = evaluator.evaluate()
execution.wait()
execution.show_results()
evaluate_base_model: bool#
classmethod get_all(session=None, region=None)[source]#

List all MTRL evaluation executions in the account / region.

Parameters:
  • session – Optional boto3 session.

  • region – Optional AWS region.

Yields:

EvaluationPipelineExecution – MTRL evaluation execution instances.

property hyperparameters#

Lazy-load evaluation hyperparameters from the JumpStart hub.

Returns a FineTuningOptions object exposing to_dict(), get_info(), and attribute-style read/write access with hub-sourced validation (type + range).

Supported parameters (sourced from the AgentRFT evaluation recipe): eval_group_size, sampling_temperature, top_p, max_tokens, pass_k_values, success_threshold.

Raises:

ValueError – If the base model name is not available or the hub does not expose an AgentRFTEvaluation override spec for the model.

kms_key_id: str | None#
static list_bedrock_agentcore_runtimes(session=None) list[source]#

List Bedrock AgentCore runtimes.

Parameters:

session – Optional boto3 session.

Returns:

List of dicts, each with keys name, runtime_id, arn, and status.

static list_supported_models(session=None) list[source]#

Return the list of models that support MTRL evaluation.

Queries SageMakerPublicHub to discover all models with MTRL recipes in their RecipeCollection.

Parameters:

session – Optional boto3 session.

Returns:

List of hub content model names supporting MTRL evaluation.

mlflow_experiment_name: str | None#
mlflow_resource_arn: str | None#
mlflow_run_name: str | None#
model: str | BaseTrainer | AgentRFTJob | ModelPackage#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_package_group: str | ModelPackageGroup | None#
model_post_init(context: Any, /) None#

This function is meant to behave like a BaseModel method to initialize private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

networking: VpcConfig | None#
region: str | None#
role: str | None#
s3_output_path: str#
sagemaker_session: Any | None#
stopping_condition: int#
tags: List[Dict[str, str]] | None#