sagemaker.train.evaluate.multi_turn_rl_evaluator#
MultiTurnRLEvaluator — evaluate MTRL agents on held-out prompts.
This module implements MultiTurnRLEvaluator, the SDK surface for
evaluating Multi-Turn Reinforcement Learning (MTRL) agent models via the
AgentRFT CreateJob pipeline step. Mirrors the architecture of
sagemaker.train.evaluate.BenchMarkEvaluator, with MTRL-specific
fields, validators, and the three-template rendering surface defined in
sagemaker.train.evaluate.mtrl_pipeline_templates.
Classes
|
Evaluate a multi-turn RL agent model against a held-out prompt dataset. |
- class sagemaker.train.evaluate.multi_turn_rl_evaluator.MultiTurnRLEvaluator(*, region: str | None = None, role: str | None = None, sagemaker_session: Any | None = None, model: str | BaseTrainer | AgentRFTJob | ModelPackage, base_eval_name: str | None = None, s3_output_path: str, mlflow_resource_arn: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, networking: VpcConfig | None = None, kms_key_id: str | None = None, model_package_group: str | ModelPackageGroup | None = None, dataset: Any, agent_config: Any | None = None, agent_qualifier: str | None = None, accept_eula: bool = True, evaluate_base_model: bool = False, stopping_condition: int = 86400, tags: List[Dict[str, str]] | None = None)[source]#
Bases:
BaseEvaluatorEvaluate a multi-turn RL agent model against a held-out prompt dataset.
The evaluator runs rollouts of the agent against an environment (Bedrock AgentCore runtime or a Lambda-wrapped agent) and computes aggregate metrics (pass@k, mean reward, etc.). Execution routes through SageMaker Pipelines using the new AgentRFT
Jobstep type (JobCategory="AgentRFTEvaluation").The evaluator supports three evaluation shapes, selected automatically based on the provided inputs:
Base-model only — pass a base model (JumpStart ID or ModelPackage) with an explicit
agent_config.Fine-tuned only — pass a
MultiTurnRLTraineror a fine-tunedModelPackage; the evaluator extracts the source model package ARN and evaluates it only.Base + fine-tuned comparison — pass
evaluate_base_model=Truealong with a fine-tuned trainer / ModelPackage; both runs land in the same MLflow experiment for side-by-side comparison.
- dataset#
Prompt dataset — S3 URI, hub-content DataSet ARN, or object exposing an
.arnattribute. Required.- Type:
Union[str, Any]
- agent_config#
Agent environment — Bedrock AgentCore ARN or Lambda ARN. Auto-resolved from a
MultiTurnRLTrainerwhen provided asmodel.- Type:
Optional[Union[str, Any]]
- agent_qualifier#
Bedrock AgentCore qualifier (e.g.
"PROD"). Ignored whenagent_configis a Lambda.- Type:
Optional[str]
- accept_eula#
Forwarded to
JobConfigDocument.EvaluationConfig.AcceptEula. Defaults toTrue(templates emittrueunconditionally; flag kept for future backend schemas).- Type:
bool
- evaluate_base_model#
When
Trueand a fine-tuned model is present, render the comparison template (both base and fine-tuned are evaluated). Defaults toFalse— fine-tuned only.- Type:
bool
- stopping_condition#
Maximum job duration in seconds. Default
86400(24 hours); must be in(0, 259200].- Type:
int
- tags#
Customer tags propagated to the pipeline + step
Tagslist.- Type:
Optional[List[Dict[str, str]]]
- See :class:`BaseEvaluator` for inherited fields (``model``,
- ``s3_output_path``, ``mlflow_resource_arn``,
- ``mlflow_experiment_name``, ``networking``, ``kms_key_id``,
- ``model_package_group``, ``base_eval_name``, ``region``, ``role``,
- ``sagemaker_session``).
Example
from sagemaker.train.evaluate import MultiTurnRLEvaluator # Evaluate a fine-tuned MTRL trainer output evaluator = MultiTurnRLEvaluator( model=completed_mtrl_trainer, dataset='s3://my-bucket/eval-prompts.jsonl', s3_output_path='s3://my-bucket/mtrl-eval-output/', ) execution = evaluator.evaluate() execution.wait() execution.show_results()
- accept_eula: bool#
- agent_config: Any | None#
- agent_qualifier: str | None#
- base_eval_name: str | None#
- dataset: Any#
- evaluate() MTRLEvaluationExecution[source]#
Render the MTRL pipeline and start a non-blocking execution.
- Returns:
The started pipeline execution. Call
.wait()to block until completion and.show_results()to render the aggregate report.- Return type:
Example
execution = evaluator.evaluate() execution.wait() execution.show_results()
- evaluate_base_model: bool#
- classmethod get_all(session=None, region=None)[source]#
List all MTRL evaluation executions in the account / region.
- Parameters:
session – Optional boto3 session.
region – Optional AWS region.
- Yields:
EvaluationPipelineExecution – MTRL evaluation execution instances.
- property hyperparameters#
Lazy-load evaluation hyperparameters from the JumpStart hub.
Returns a
FineTuningOptionsobject exposingto_dict(),get_info(), and attribute-style read/write access with hub-sourced validation (type + range).Supported parameters (sourced from the AgentRFT evaluation recipe):
eval_group_size,sampling_temperature,top_p,max_tokens,pass_k_values,success_threshold.- Raises:
ValueError – If the base model name is not available or the hub does not expose an AgentRFTEvaluation override spec for the model.
- kms_key_id: str | None#
- static list_bedrock_agentcore_runtimes(session=None) list[source]#
List Bedrock AgentCore runtimes.
- Parameters:
session – Optional boto3 session.
- Returns:
List of dicts, each with keys
name,runtime_id,arn, andstatus.
- static list_supported_models(session=None) list[source]#
Return the list of models that support MTRL evaluation.
Queries SageMakerPublicHub to discover all models with MTRL recipes in their
RecipeCollection.- Parameters:
session – Optional boto3 session.
- Returns:
List of hub content model names supporting MTRL evaluation.
- mlflow_experiment_name: str | None#
- mlflow_resource_arn: str | None#
- mlflow_run_name: str | None#
- model: str | BaseTrainer | AgentRFTJob | ModelPackage#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_package_group: str | ModelPackageGroup | None#
- model_post_init(context: Any, /) None#
This function is meant to behave like a BaseModel method to initialize private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- region: str | None#
- role: str | None#
- s3_output_path: str#
- sagemaker_session: Any | None#
- stopping_condition: int#
- tags: List[Dict[str, str]] | None#