Getting Started with Quota Management#
Quota management enables administrators to efficiently allocate shared compute resources between teams and projects by defining compute quotas and strategies for sharing capacity between quota shares. Each quota share operates as a virtual queue. When scheduling jobs for a job queue, AWS Batch will iterate through all attached quota shares to dispatch jobs that fit within their configured capacity and borrowing limits.
This notebook shows how to create quota management resources in AWS Batch for SageMaker Training jobs, and illustrates how the AWS Batch scheduler enables resource sharing between quota shares, leveraging preemption to restore borrowed idle capacity when jobs arrive.
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Setup and Configure Training Job Variables#
We will need a few instances for a short duration for the sample jobs. Change any of the constant variables below to adjust the example to your liking.
INSTANCE_TYPE = "ml.g5.xlarge"
INSTANCE_COUNT = 1
MAX_RUN_TIME = 300
TRAINING_JOB_NAME = "hello-world-simple-job"
import logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logging.getLogger("botocore.client").setLevel(level=logging.WARN)
logger = logging.getLogger(__name__)
from sagemaker.core.helper.session_helper import Session
from sagemaker.core import image_uris
session = Session()
image_uri = image_uris.retrieve(
framework="pytorch",
region=session.boto_session.region_name,
version="2.5",
instance_type=INSTANCE_TYPE,
image_scope="training",
)
Create Sample Resources#
Here we create the AWS Batch service environment, job queue and quota shares that we will use to enqueue our training jobs. Each quota share is configured with its own dedicated capacity limits, and may be configured to lend idle capacity to or borrow idle capacity from other quota shares. Cross-share preemption is always on, and allows a given quota share to take back any capacity it has lended to other quota shares when needed. In-share preemption can be enabled to allow high priority jobs to preempt low priority jobs within a given quota share.
QS1: configured with the
LEND_AND_BORROWresource sharing strategy, and a borrow limit of 200%. This allows QS1 to both lend its own idle capacity and borrow idle capacity from any other quota share that is configured with aLENDorLEND_AND_BORROWresource sharing strategy. QS1 is also configured with in-share preemption, which allows jobs within QS1 to preempt each other based on priority.QS2: configured with the
LENDresource sharing strategy. This configuration allows QS2 to lend its own idle capacity but not borrow any other quota share’s idle capacity.QS3: configured with the
RESERVEresource sharing strategy. This configuration prevents QS3 from borrowing idle capacity from, and lending idle capacity to other quota shares.
You can use Batch Console to create these resources, or you can run the cell below. The create_quota_management_resources function below will skip creating any resources that already exist.
from sagemaker.train.aws_batch.boto_client import get_batch_boto_client
from utils.aws_batch_resource_management import AwsBatchResourceManager, QuotaShareConfig, create_quota_management_resources
SCHEDULING_POLICY_NAME = "my-qm-scheduling-policy"
JOB_QUEUE_NAME = "my-sm-training-qm-jq"
SERVICE_ENVIRONMENT_NAME = "my-sm-training-qm-se"
# Create SchedulingPolicy, ServiceEnvironment, JobQueue, and QuotaShares
resource_manager = AwsBatchResourceManager(get_batch_boto_client())
qs1 = QuotaShareConfig(
name="QS1",
capacity_unit=INSTANCE_TYPE,
max_capacity=1,
in_share_preemption=True,
sharing_strategy="LEND_AND_BORROW",
borrow_limit=200
)
qs2 = QuotaShareConfig(
name="QS2",
capacity_unit=INSTANCE_TYPE,
max_capacity=1,
in_share_preemption=False,
sharing_strategy="LEND"
)
qs3 = QuotaShareConfig(
name="QS3",
capacity_unit=INSTANCE_TYPE,
max_capacity=1,
in_share_preemption=False,
sharing_strategy="RESERVE"
)
resources = create_quota_management_resources(
resource_manager=resource_manager,
scheduling_policy_name=SCHEDULING_POLICY_NAME,
job_queue_name=JOB_QUEUE_NAME,
service_environment_name=SERVICE_ENVIRONMENT_NAME,
capacity_unit=INSTANCE_TYPE,
max_capacity=3,
quota_share_configs=[qs1, qs2, qs3]
)
Create Hello World Model Trainer#
Now that our resources are created, we’ll construct a simple ModelTrainer. Any model trainer may be used, you may import your own instead of constructing a new one here if you wish!
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.train.configs import SourceCode, Compute, StoppingCondition
source_code = SourceCode(command="echo 'Hello World'")
model_trainer = ModelTrainer(
training_image=image_uri,
source_code=source_code,
base_job_name=TRAINING_JOB_NAME,
compute=Compute(instance_type=INSTANCE_TYPE, instance_count=INSTANCE_COUNT),
stopping_condition=StoppingCondition(max_runtime_in_seconds=MAX_RUN_TIME),
)
Create TrainingQueue object#
Using our queue is as easy as referring to it by name in the TrainingQueue constructor. The TrainingQueue class within the SageMaker Python SDK provides built in support for working with Batch queues.
from sagemaker.train.aws_batch.training_queue import TrainingQueue, TrainingQueuedJob
# Construct the queue object using the SageMaker Python SDK
queue = TrainingQueue(JOB_QUEUE_NAME)
logger.info(f"Using queue: {queue.queue_name}")
Update job priority#
This will trigger in-share job preemption. QS1 has in-share job preemption enabled, which allows a high priority job to preempt a low priority job.
# Updating qs1_job_low to be the highest priority job in QS1
qs1_job_low.update(scheduling_priority=4)
logger.info(f"Updated job {qs1_job_low.job_name} to increase its priority")
logger.info(f"Waiting for jobs to transition...")
await_jobs([
([qs1_job_low], JobStatus.dispatched() | JobStatus.terminal()),
([qs1_job_high], {JobStatus.RUNNABLE}) # High priority job will preempt lower priority jobs within the same quota share
])
list_jobs_by_quota_share(queue, [qs1.name, qs2.name, qs3.name], JobStatus.active())
Optional: Delete AWS Batch Resources#
This shows how to delete the AWS Batch ServiceEnvironment and JobQueue. This step is completely optional, uncomment the code below to delete the resources created a few steps above.
from utils.aws_batch_resource_management import delete_resources
# delete_resources(resource_manager, resources)
Notebook CI Test Results#
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.