Model Evaluation

Model Evaluation#

Launch evaluation jobs with the following options:

  • LLM as a Judge (LLMAJ) Evaluation - Use large language models to assess model outputs

  • InspectAI Evaluation - Run open-source InspectAI benchmark tasks on SageMaker infrastructure

  • Custom Scorer Evaluation - Apply previously defined evaluator functions

  • Benchmark Evaluation - Run standardized performance benchmarks

  • Multi-Turn RL (Agentic) Evaluation - Evaluate multi-turn agent models with rollout-based metrics (pass@k, mean reward)