Deploy Models for Inference

Deploy Models for Inference#

SageMaker Python SDK V3 transforms model deployment and inference with the unified ModelBuilder class, replacing the complex framework-specific model classes from V2. This modern approach provides a consistent interface for all inference scenarios while maintaining the flexibility and performance you need.

Key Benefits of V3 Inference#

Unified Interface: Single ModelBuilder class replaces multiple framework-specific model classes
Simplified Deployment: Object-oriented API with intelligent defaults for endpoint configuration
Enhanced Performance: Optimized inference pipelines with automatic scaling and load balancing
Multi-Modal Support: Deploy models for real-time, batch, and serverless inference scenarios

Quick Start Example#

Here’s how inference has evolved from V2 to V3:

SageMaker Python SDK V2:

from sagemaker.model import Model
from sagemaker.predictor import Predictor

model = Model(
    image_uri="my-inference-image",
    model_data="s3://my-bucket/model.tar.gz",
    role="arn:aws:iam::123456789012:role/SageMakerRole"
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge"
)
result = predictor.predict(data)

SageMaker Python SDK V3:

from sagemaker.serve import ModelBuilder

model_builder = ModelBuilder(
    model="my-model",
    model_path="s3://my-bucket/model.tar.gz"
)

model = model_builder.build(model_name="my-deployed-model")

endpoint = model_builder.deploy(
    endpoint_name="my-endpoint",
    instance_type="ml.m5.xlarge",
    initial_instance_count=1
)

result = endpoint.invoke(
    body=data,
    content_type="application/json"
)

ModelBuilder Overview#

The ModelBuilder class is the cornerstone of SageMaker Python SDK V3 inference, providing a unified interface for all deployment scenarios. This single class replaces the complex web of framework-specific model classes from V2, offering:

Unified Deployment Interface: One class handles PyTorch, TensorFlow, Scikit-learn, XGBoost, HuggingFace, and custom containers
Intelligent Optimization: Automatically optimizes model serving configuration based on your model characteristics
Flexible Deployment Options: Support for real-time endpoints, batch transform, and serverless inference
Seamless Integration: Works seamlessly with SageMaker features like auto-scaling, multi-model endpoints, and A/B testing

from sagemaker.serve import ModelBuilder

model_builder = ModelBuilder(
    model="your-model",
    model_path="s3://your-bucket/model-artifacts",
    role="your-sagemaker-role"
)

model = model_builder.build(model_name="my-model")

endpoint = model_builder.deploy(
    endpoint_name="my-endpoint",
    instance_type="ml.m5.xlarge",
    initial_instance_count=1
)

response = endpoint.invoke(
    body={"inputs": "your-input-data"},
    content_type="application/json"
)

Inference Capabilities#

Model Optimization Support#

V3 introduces powerful model optimization capabilities for enhanced performance:

SageMaker Neo - Optimize models for specific hardware targets
TensorRT Integration - Accelerate deep learning inference on NVIDIA GPUs
ONNX Runtime - Cross-platform model optimization and acceleration
Quantization Support - Reduce model size and improve inference speed

Model Optimization Example:

from sagemaker.serve import ModelBuilder

# Create ModelBuilder with optimization settings
model_builder = ModelBuilder(
    model="huggingface-bert-base",
    role="your-sagemaker-role"
)

# Build and deploy with optimization
model = model_builder.build(model_name="optimized-bert")
endpoint = model_builder.deploy(
    endpoint_name="bert-endpoint",
    instance_type="ml.inf1.xlarge",
    initial_instance_count=1
)

Key Inference Features#

Multi-Model Endpoints - Host multiple models on a single endpoint with automatic model loading and unloading for cost optimization
Auto-Scaling Integration - Automatically scale endpoint capacity based on traffic patterns with configurable scaling policies
A/B Testing Support - Deploy multiple model variants with traffic splitting for safe model updates and performance comparison
Batch Transform Jobs - Process large datasets efficiently with automatic data partitioning and parallel processing
Serverless Inference - Pay-per-request pricing with automatic scaling from zero to handle variable workloads

Supported Inference Scenarios#

Deployment Types#

Real-Time Endpoints - Low-latency inference for interactive applications
Batch Transform - High-throughput processing for large datasets
Serverless Inference - Cost-effective inference for variable workloads
Multi-Model Endpoints - Host multiple models on shared infrastructure

Framework Support#

PyTorch - Deep learning models with dynamic computation graphs
TensorFlow - Production-ready machine learning models at scale
Scikit-learn - Classical machine learning algorithms
XGBoost - Gradient boosting models for structured data
HuggingFace - Pre-trained transformer models for NLP tasks
Custom Containers - Bring your own inference logic and dependencies

Advanced Features#

Model Monitoring - Track model performance and data drift in production
Endpoint Security - VPC support, encryption, and IAM-based access control
Multi-AZ Deployment - High availability with automatic failover
Custom Inference Logic - Implement preprocessing, postprocessing, and custom prediction logic

Migration from V2#

If you’re migrating from V2, the key changes are:

Replace framework-specific model classes (PyTorchModel, TensorFlowModel, etc.) with ModelBuilder
Use structured configuration objects instead of parameter dictionaries
Leverage the new invoke() method instead of predict() for more consistent API
Take advantage of built-in optimization and auto-scaling features

Inference Examples#

Explore comprehensive inference examples that demonstrate V3 capabilities: