The Amazon SageMaker PCA algorithm.
PCA(role, train_instance_count, train_instance_type, num_components, algorithm_mode=None, subtract_mean=None, extra_components=None, **kwargs)¶
A Principal Components Analysis (PCA)
This Estimator may be fit via calls to
fit(). The former allows a PCA model to be fit on a 2-dimensional numpy array. The latter requires Amazon
Recordprotobuf serialized data to be stored in S3.
To learn more about the Amazon protobuf Record class and how to prepare bulk data in this format, please consult AWS technical documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
After this Estimator is fit, model data is stored in S3. The model may be deployed to an Amazon SageMaker Endpoint by invoking
deploy(). As well as deploying an Endpoint, deploy returns a
PCAPredictorobject that can be used to project input vectors to the learned lower-dimensional representation, using the trained PCA model hosted in the SageMaker Endpoint.
PCA Estimators can be configured by setting hyperparameters. The available hyperparameters for PCA are documented below. For further information on the AWS PCA algorithm, please consult AWS technical documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/pca.html
This Estimator uses Amazon SageMaker PCA to perform training and host deployed models. To learn more about Amazon SageMaker PCA, please read: https://docs.aws.amazon.com/sagemaker/latest/dg/how-pca-works.html
- role (str) – An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if accessing AWS resource.
- train_instance_count (int) – Number of Amazon EC2 instances to use for training.
- train_instance_type (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’.
- num_components (int) – The number of principal components. Must be greater than zero.
- algorithm_mode (str) – Mode for computing the principal components. One of ‘regular’, ‘stable’ or ‘randomized’.
- subtract_mean (bool) – Whether the data should be unbiased both during train and at inference.
- extra_components (int) – As the value grows larger, the solution becomes more accurate but the runtime and memory consumption increase linearly. If this value is unset, then a default value equal to the maximum of 10 and num_components will be used. Valid for randomized mode only.
- **kwargs – base class keyword argument values.
PCAModelreferencing the latest s3 model data produced by this Estimator.
Delete an Amazon SageMaker
ValueError– If the endpoint does not exist.
deploy(initial_instance_count, instance_type, endpoint_name=None, **kwargs)¶
Deploy the trained model to an Amazon SageMaker endpoint and return a
- initial_instance_count (int) – Minimum number of EC2 instances to deploy to an endpoint for prediction.
- instance_type (str) – Type of EC2 instance to deploy to an endpoint for prediction, for example, ‘ml.c4.xlarge’.
- endpoint_name (str) – Name to use for creating an Amazon SageMaker endpoint. If not specified, the name of the training job is used.
- **kwargs – Passed to invocation of
create_model(). Implementations may customize
**kwargsto customize model creation during deploy. For more, see the implementation docs.
- A predictor that provides a
which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.
fit(records, mini_batch_size=None, **kwargs)¶
Fit this Estimator on serialized Record objects, stored in S3.
recordsshould be an instance of
RecordSet. This defines a collection of s3 data files to train this
Training data is expected to be encoded as dense or sparse vectors in the “values” feature on each Record. If the data is labeled, the label is expected to be encoded as a list of scalas in the “values” feature of the Record label.
More information on the Amazon Record format is available at: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
record_set()to construct a
str – The model location in S3. Only set if Estimator has been
record_set(train, labels=None, channel='train')¶
RecordSetfrom a numpy
ndarraymatrix and label vector.
For the 2D
train, each row is converted to a
Recordobject. The vector is stored in the “values” entry of the
featuresproperty of each Record. If
labelsis not None, each corresponding label is assigned to the “values” entry of the
labelsproperty of each Record.
The collection of
Recordobjects are protobuf serialized and uploaded to new S3 locations. A manifest file is generated containing the list of objects created and also stored in S3.
The number of S3 objects created is controlled by the
train_instance_countproperty on this Estimator. One S3 object is created per training instance.
- train (numpy.ndarray) – A 2D numpy array of training data.
- labels (numpy.ndarray) – A 1D numpy array of labels. Its length must be equal to the
number of rows in
- channel (str) – The SageMaker TrainingJob channel this RecordSet should be assigned to.
A RecordSet referencing the encoded, uploading training and label data.
PCAModel(model_data, role, sagemaker_session=None)¶
Reference PCA s3 model data. Calling
deploy()creates an Endpoint and return a Predictor that transforms vectors to a lower-dimensional representation.
Transforms input vectors to lower-dimesional representations.
The implementation of
predict()in this RealTimePredictor requires a numpy
ndarrayas input. The array should contain the same number of columns as the feature-dimension of the data used to fit the model this Predictor performs inference on.
predict()returns a list of
Recordobjects, one for each row in the input
ndarray. The lower dimension vector result is stored in the
projectionkey of the