K-means¶

The Amazon SageMaker K-means algorithm.

class sagemaker.KMeans(role, train_instance_count, train_instance_type, k, init_method=None, max_iterations=None, tol=None, num_trials=None, local_init_method=None, half_life_time_size=None, epochs=None, center_factor=None, eval_metrics=None, **kwargs)¶

Bases: sagemaker.amazon.amazon_estimator.AmazonAlgorithmEstimatorBase

A k-means clustering AmazonAlgorithmEstimatorBase. Finds k clusters of data in an unlabeled dataset.

This Estimator may be fit via calls to fit_ndarray() or fit(). The former allows a KMeans model to be fit on a 2-dimensional numpy array. The latter requires Amazon Record protobuf serialized data to be stored in S3.

To learn more about the Amazon protobuf Record class and how to prepare bulk data in this format, please consult AWS technical documentation: https://alpha-docs-aws.amazon.com/sagemaker/latest/dg/cdf-training.html

After this Estimator is fit, model data is stored in S3. The model may be deployed to an Amazon SageMaker Endpoint by invoking deploy(). As well as deploying an Endpoint, deploy returns a KMeansPredictor object that can be used to k-means cluster assignments, using the trained k-means model hosted in the SageMaker Endpoint.

KMeans Estimators can be configured by setting hyperparameters. The available hyperparameters for KMeans are documented below. For further information on the AWS KMeans algorithm, please consult AWS technical documentation: https://alpha-docs-aws.amazon.com/sagemaker/latest/dg/k-means.html

Parameters:

role (str) – An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if accessing AWS resource. For more information, see <link>???.
train_instance_count (int) – Number of Amazon EC2 instances to use for training.
train_instance_type (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’.
k (int) – The number of clusters to produce.
init_method (str) – How to initialize cluster locations. One of ‘random’ or ‘kmeans++’.
max_iterations (int) – Maximum iterations for Lloyds EM procedure in the local kmeans used in finalize stage.
tol (float) – Tolerance for change in ssd for early stopping in local kmeans.
num_trials (int) – Local version is run multiple times and the one with the best loss is chosen. This determines how many times.
local_init_method (str) – Initialization method for local version. One of ‘random’, ‘kmeans++’
half_life_time_size (int) – The points can have a decayed weight. When a point is observed its weight, with regard to the computation of the cluster mean is 1. This weight will decay exponentially as we observe more points. The exponent coefficient is chosen such that after observing half_life_time_size points after the mentioned point, its weight will become 1/2. If set to 0, there will be no decay.
epochs (int) – Number of passes done over the training data.
center_factor (int) – The algorithm will create num_clusters * extra_center_factor as it runs and reduce the number of centers to k when finalizing
eval_metrics (list) – JSON list of metrics types to be used for reporting the score for the model. Allowed values are “msd” Means Square Error, “ssd”: Sum of square distance. If test data is provided, the score shall be reported in terms of all requested metrics.
**kwargs – base class keyword argument values.

repo_name = 'kmeans'¶

repo_version = 1¶

classmethod attach(training_job_name, sagemaker_session=None, job_details=None)¶

Attach to an existing training job.

Create an Estimator bound to an existing training job, each subclass is responsible to implement _prepare_init_params_from_job_description() as this method delegates the actual conversion of a training job description to the arguments that the class constructor expects. After attaching, if the training job has a Complete status, it can be deploy() ed to create a SageMaker Endpoint and return a Predictor.

If the training job is in progress, attach will block and display log messages from the training job, until the training job completes.

Parameters:	training_job_name (str) – The name of the training job to attach to. sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.

Examples

>>> my_estimator.fit(wait=False)
>>> training_job_name = my_estimator.latest_training_job.name
Later on:
>>> attached_estimator = Estimator.attach(training_job_name)
>>> attached_estimator.deploy()

Returns:	Instance of the calling `Estimator` Class with the attached training job.

data_location¶

delete_endpoint()¶

Delete an Amazon SageMaker Endpoint.

Raises:	`ValueError` – If the endpoint does not exist.

deploy(initial_instance_count, instance_type, endpoint_name=None, **kwargs)¶

Deploy the trained model to an Amazon SageMaker endpoint and return a sagemaker.RealTimePredictor object.

More information: http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

Parameters:

initial_instance_count (int) – Minimum number of EC2 instances to deploy to an endpoint for prediction.
instance_type (str) – Type of EC2 instance to deploy to an endpoint for prediction, for example, ‘ml.c4.xlarge’.
endpoint_name (str) – Name to use for creating an Amazon SageMaker endpoint. If not specified, the name of the training job is used.
**kwargs – Passed to invocation of create_model(). Implementations may customize create_model() to accept **kwargs to customize model creation during deploy. For more, see the implementation docs.

Returns:

A predictor that provides a predict() method,: which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.

Return type:

sagemaker.predictor.RealTimePredictor

model_data¶: str – The model location in S3. Only set if Estimator has been fit().

record_set(train, labels=None, channel='train')¶

Build a RecordSet from a numpy ndarray matrix and label vector.

For the 2D ndarray train, each row is converted to a Record object. The vector is stored in the “values” entry of the features property of each Record. If labels is not None, each corresponding label is assigned to the “values” entry of the labels property of each Record.

The collection of Record objects are protobuf serialized and uploaded to new S3 locations. A manifest file is generated containing the list of objects created and also stored in S3.

The number of S3 objects created is controlled by the train_instance_count property on this Estimator. One S3 object is created per training instance.

Parameters:	train (numpy.ndarray) – A 2D numpy array of training data. labels (numpy.ndarray) – A 1D numpy array of labels. Its length must be equal to the number of rows in `train`. channel (str) – The SageMaker TrainingJob channel this RecordSet should be assigned to.
Returns:	A RecordSet referencing the encoded, uploading training and label data.
Return type:	RecordSet

train_image()¶

Return the Docker image to use for training.

The fit() method, which does the model training, calls this method to find the image to use for model training.

Returns:	The URI of the Docker image.
Return type:	str

eval_metrics¶: An algorithm hyperparameter with optional validation. Implemented as a python descriptor object.

create_model()¶: Return a KMeansModel referencing the latest s3 model data produced by this Estimator.

fit(records, mini_batch_size=5000, **kwargs)¶

Fit this Estimator on serialized Record objects, stored in S3.

records should be an instance of RecordSet. This defines a collection of s3 data files to train this Estimator on.

Training data is expected to be encoded as dense or sparse vectors in the “values” feature on each Record. If the data is labeled, the label is expected to be encoded as a list of scalas in the “values” feature of the Record label.

More information on the Amazon Record format is available at: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

See record_set() to construct a RecordSet object from ndarray arrays.

Parameters:	records (`RecordSet`) – The records to train this `Estimator` on mini_batch_size (int or None) – The size of each mini-batch to use when training. If None, a default value will be used.

hyperparameters()¶: Return the SageMaker hyperparameters for training this KMeans Estimator

class sagemaker.KMeansModel(model_data, role, sagemaker_session=None)¶

Bases: sagemaker.model.Model

Reference KMeans s3 model data. Calling deploy() creates an Endpoint and return a Predictor to performs k-means cluster assignment.

class sagemaker.KMeansPredictor(endpoint, sagemaker_session=None)¶

Bases: sagemaker.predictor.RealTimePredictor

Assigns input vectors to their closest cluster in a KMeans model.

The implementation of predict() in this RealTimePredictor requires a numpy ndarray as input. The array should contain the same number of columns as the feature-dimension of the data used to fit the model this Predictor performs inference on.

predict() returns a list of Record objects, one for each row in the input ndarray. The nearest cluster is stored in the closest_cluster key of the Record.label field.