LDA¶
The Amazon SageMaker LDA algorithm.
-
class
sagemaker.
LDA
(role, train_instance_type, num_topics, alpha0=None, max_restarts=None, max_iterations=None, tol=None, **kwargs)¶ Bases:
sagemaker.amazon.amazon_estimator.AmazonAlgorithmEstimatorBase
Placeholder docstring
Latent Dirichlet Allocation (LDA) is
Estimator
used for unsupervised learning.Amazon SageMaker Latent Dirichlet Allocation is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics.
This Estimator may be fit via calls to
fit()
. It requires AmazonRecord
protobuf serialized data to be stored in S3. There is an utilityrecord_set()
that can be used to upload data to S3 and createsRecordSet
to be passed to the fit call.To learn more about the Amazon protobuf Record class and how to prepare bulk data in this format, please consult AWS technical documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
After this Estimator is fit, model data is stored in S3. The model may be deployed to an Amazon SageMaker Endpoint by invoking
deploy()
. As well as deploying an Endpoint, deploy returns aLDAPredictor
object that can be used for inference calls using the trained model hosted in the SageMaker Endpoint.LDA Estimators can be configured by setting hyperparameters. The available hyperparameters for LDA are documented below.
For further information on the AWS LDA algorithm, please consult AWS technical documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/lda.html
Parameters: - role (str) – An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if accessing AWS resource.
- train_instance_type (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’.
- num_topics (int) – The number of topics for LDA to find within the data.
- alpha0 (float) – Optional. Initial guess for the concentration parameter
- max_restarts (int) – Optional. The number of restarts to perform during the Alternating Least Squares (ALS) spectral decomposition phase of the algorithm.
- max_iterations (int) – Optional. The maximum number of iterations to perform during the ALS phase of the algorithm.
- tol (float) – Optional. Target error tolerance for the ALS phase of the algorithm.
- **kwargs – base class keyword argument values.
Tip
You can find additional parameters for initializing this class at
AmazonAlgorithmEstimatorBase
andEstimatorBase
.-
repo_name
= 'lda'¶
-
repo_version
= 1¶
-
create_model
(vpc_config_override='VPC_CONFIG_DEFAULT', **kwargs)¶ Return a
LDAModel
referencing the latest s3 model data produced by this Estimator.Parameters: - vpc_config_override (dict[str, list[str]]) – Optional override for VpcConfig set on the model. Default: use subnets and security groups from this Estimator. * ‘Subnets’ (list[str]): List of subnet ids. * ‘SecurityGroupIds’ (list[str]): List of security group ids.
- **kwargs – Additional kwargs passed to the LDAModel constructor.
-
classmethod
attach
(training_job_name, sagemaker_session=None, model_channel_name='model')¶ Attach to an existing training job.
Create an Estimator bound to an existing training job, each subclass is responsible to implement
_prepare_init_params_from_job_description()
as this method delegates the actual conversion of a training job description to the arguments that the class constructor expects. After attaching, if the training job has a Complete status, it can bedeploy()
ed to create a SageMaker Endpoint and return aPredictor
.If the training job is in progress, attach will block and display log messages from the training job, until the training job completes.
Examples
>>> my_estimator.fit(wait=False) >>> training_job_name = my_estimator.latest_training_job.name Later on: >>> attached_estimator = Estimator.attach(training_job_name) >>> attached_estimator.deploy()
Parameters: - training_job_name (str) – The name of the training job to attach to.
- sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.
- model_channel_name (str) – Name of the channel where pre-trained model data will be downloaded (default: ‘model’). If no channel with the same name exists in the training job, this option will be ignored.
Returns: Instance of the calling
Estimator
Class with the attached training job.
-
compile_model
(target_instance_family, input_shape, output_path, framework=None, framework_version=None, compile_max_run=300, tags=None, **kwargs)¶ Compile a Neo model using the input model.
Parameters: - target_instance_family (str) – Identifies the device that you want to run your model after compilation, for example: ml_c5. For allowed strings see https://docs.aws.amazon.com/sagemaker/latest/dg/API_OutputConfig.html.
- input_shape (dict) – Specifies the name and shape of the expected inputs for your trained model in json dictionary form, for example: {‘data’:[1,3,1024,1024]}, or {‘var1’: [1,1,28,28], ‘var2’:[1,1,28,28]}
- output_path (str) – Specifies where to store the compiled model
- framework (str) – The framework that is used to train the original model. Allowed values: ‘mxnet’, ‘tensorflow’, ‘keras’, ‘pytorch’, ‘onnx’, ‘xgboost’
- framework_version (str) – The version of the framework
- compile_max_run (int) – Timeout in seconds for compilation (default: 3 * 60). After this amount of time Amazon SageMaker Neo terminates the compilation job regardless of its current status.
- tags (list[dict]) – List of tags for labeling a compilation job. For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
- **kwargs – Passed to invocation of
create_model()
. Implementations may customizecreate_model()
to accept**kwargs
to customize model creation during deploy. For more, see the implementation docs.
Returns: A SageMaker
Model
object. SeeModel()
for full details.Return type:
-
data_location
¶ Placeholder docstring
-
delete_endpoint
()¶ Delete an Amazon SageMaker
Endpoint
.Raises: botocore.exceptions.ClientError
– If the endpoint does not exist.
-
deploy
(initial_instance_count, instance_type, accelerator_type=None, endpoint_name=None, use_compiled_model=False, update_endpoint=False, wait=True, model_name=None, kms_key=None, data_capture_config=None, tags=None, **kwargs)¶ Deploy the trained model to an Amazon SageMaker endpoint and return a
sagemaker.RealTimePredictor
object.More information: http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
Parameters: - initial_instance_count (int) – Minimum number of EC2 instances to deploy to an endpoint for prediction.
- instance_type (str) – Type of EC2 instance to deploy to an endpoint for prediction, for example, ‘ml.c4.xlarge’.
- accelerator_type (str) – Type of Elastic Inference accelerator to attach to an endpoint for model loading and inference, for example, ‘ml.eia1.medium’. If not specified, no Elastic Inference accelerator will be attached to the endpoint. For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html
- endpoint_name (str) – Name to use for creating an Amazon SageMaker endpoint. If not specified, the name of the training job is used.
- use_compiled_model (bool) – Flag to select whether to use compiled (optimized) model. Default: False.
- update_endpoint (bool) – Flag to update the model in an existing Amazon SageMaker endpoint. If True, this will deploy a new EndpointConfig to an already existing endpoint and delete resources corresponding to the previous EndpointConfig. Default: False
- wait (bool) – Whether the call should wait until the deployment of model completes (default: True).
- model_name (str) – Name to use for creating an Amazon SageMaker model. If not specified, the name of the training job is used.
- kms_key (str) – The ARN of the KMS key that is used to encrypt the data on the storage volume attached to the instance hosting the endpoint.
- data_capture_config (sagemaker.model_monitor.DataCaptureConfig) – Specifies configuration related to Endpoint data capture for use with Amazon SageMaker Model Monitoring. Default: None.
- tags (List[dict[str, str]]) – Optional. The list of tags to attach to this specific endpoint. Example: >>> tags = [{‘Key’: ‘tagname’, ‘Value’: ‘tagvalue’}] For more information about tags, see https://boto3.amazonaws.com/v1/documentation /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
- **kwargs – Passed to invocation of
create_model()
. Implementations may customizecreate_model()
to accept**kwargs
to customize model creation during deploy. For more, see the implementation docs.
Returns: - A predictor that provides a
predict()
method, which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.
Return type:
-
enable_network_isolation
()¶ If this Estimator can use network isolation when running.
Returns: Whether this Estimator can use network isolation or not. Return type: bool
-
fit
(records, mini_batch_size=None, wait=True, logs=True, job_name=None, experiment_config=None)¶ Fit this Estimator on serialized Record objects, stored in S3.
records
should be an instance ofRecordSet
. This defines a collection of S3 data files to train thisEstimator
on.Training data is expected to be encoded as dense or sparse vectors in the “values” feature on each Record. If the data is labeled, the label is expected to be encoded as a list of scalas in the “values” feature of the Record label.
More information on the Amazon Record format is available at: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
See
record_set()
to construct aRecordSet
object fromndarray
arrays.Parameters: - records (
RecordSet
) – The records to train thisEstimator
on - mini_batch_size (int or None) – The size of each mini-batch to use
when training. If
None
, a default value will be used. - wait (bool) – Whether the call should wait until the job completes (default: True).
- logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
- job_name (str) – Training job name. If not specified, the estimator generates a default job name, based on the training image name and current timestamp.
- experiment_config (dict[str, str]) – Experiment management configuration.
Dictionary contains three optional keys, ‘ExperimentName’,
‘TrialName’, and ‘TrialComponentName’
(default:
None
).
- records (
-
get_vpc_config
(vpc_config_override='VPC_CONFIG_DEFAULT')¶ Returns VpcConfig dict either from this Estimator’s subnets and security groups, or else validate and return an optional override value.
Parameters: vpc_config_override –
-
hyperparameters
()¶ Placeholder docstring
-
latest_job_debugger_artifacts_path
()¶ Gets the path to the DebuggerHookConfig output artifacts.
Returns: An S3 path to the output artifacts. Return type: str
-
latest_job_tensorboard_artifacts_path
()¶ Gets the path to the TensorBoardOutputConfig output artifacts.
Returns: An S3 path to the output artifacts. Return type: str
-
model_data
¶ str – The model location in S3. Only set if Estimator has been
fit()
.
-
prepare_workflow_for_training
(records=None, mini_batch_size=None, job_name=None)¶ Calls _prepare_for_training. Used when setting up a workflow.
Parameters: - records (
RecordSet
) – The records to train thisEstimator
on. - mini_batch_size (int or None) – The size of each mini-batch to use when
training. If
None
, a default value will be used. - job_name (str) – Name of the training job to be created. If not specified, one is generated, using the base name given to the constructor if applicable.
- records (
-
record_set
(train, labels=None, channel='train', encrypt=False)¶ Build a
RecordSet
from a numpyndarray
matrix and label vector.For the 2D
ndarray
train
, each row is converted to aRecord
object. The vector is stored in the “values” entry of thefeatures
property of each Record. Iflabels
is not None, each corresponding label is assigned to the “values” entry of thelabels
property of each Record.The collection of
Record
objects are protobuf serialized and uploaded to new S3 locations. A manifest file is generated containing the list of objects created and also stored in S3.The number of S3 objects created is controlled by the
train_instance_count
property on this Estimator. One S3 object is created per training instance.Parameters: - train (numpy.ndarray) – A 2D numpy array of training data.
- labels (numpy.ndarray) – A 1D numpy array of labels. Its length must
be equal to the number of rows in
train
. - channel (str) – The SageMaker TrainingJob channel this RecordSet should be assigned to.
- encrypt (bool) – Specifies whether the objects uploaded to S3 are
encrypted on the server side using AES-256 (default:
False
).
Returns: A RecordSet referencing the encoded, uploading training and label data.
Return type: RecordSet
-
train_image
()¶ Placeholder docstring
-
training_job_analytics
¶ Return a
TrainingJobAnalytics
object for the current training job.
-
transformer
(instance_count, instance_type, strategy=None, assemble_with=None, output_path=None, output_kms_key=None, accept=None, env=None, max_concurrent_transforms=None, max_payload=None, tags=None, role=None, volume_kms_key=None, vpc_config_override='VPC_CONFIG_DEFAULT')¶ Return a
Transformer
that uses a SageMaker Model based on the training job. It reuses the SageMaker Session and base job name used by the Estimator.Parameters: - instance_count (int) – Number of EC2 instances to use.
- instance_type (str) – Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.
- strategy (str) – The strategy used to decide how to batch records in a single request (default: None). Valid values: ‘MULTI_RECORD’ and ‘SINGLE_RECORD’.
- assemble_with (str) – How the output is assembled (default: None). Valid values: ‘Line’ or ‘None’.
- output_path (str) – S3 location for saving the transform result. If not specified, results are stored to a default bucket.
- output_kms_key (str) – Optional. KMS key ID for encrypting the transform output (default: None).
- accept (str) – The accept header passed by the client to the inference endpoint. If it is supported by the endpoint, it will be the format of the batch transform output.
- env (dict) – Environment variables to be set for use during the transform job (default: None).
- max_concurrent_transforms (int) – The maximum number of HTTP requests to be made to each individual transform container at one time.
- max_payload (int) – Maximum size of the payload in a single HTTP request to the container in MB.
- tags (list[dict]) – List of tags for labeling a transform job. If none specified, then the tags used for the training job are used for the transform job.
- role (str) – The
ExecutionRoleArn
IAM Role ARN for theModel
, which is also used during transform jobs. If not specified, the role from the Estimator will be used. - volume_kms_key (str) – Optional. KMS key ID for encrypting the volume attached to the ML compute instance (default: None).
- vpc_config_override (dict[str, list[str]]) – Optional override for the VpcConfig set on the model. Default: use subnets and security groups from this Estimator. * ‘Subnets’ (list[str]): List of subnet ids. * ‘SecurityGroupIds’ (list[str]): List of security group ids.
-
class
sagemaker.
LDAModel
(model_data, role, sagemaker_session=None, **kwargs)¶ Bases:
sagemaker.model.Model
Reference LDA s3 model data. Calling
deploy()
creates an Endpoint and return a Predictor that transforms vectors to a lower-dimensional representation.Parameters: - model_data –
- role –
- sagemaker_session –
- **kwargs –
-
class
sagemaker.
LDAPredictor
(endpoint, sagemaker_session=None)¶ Bases:
sagemaker.predictor.RealTimePredictor
Transforms input vectors to lower-dimesional representations.
The implementation of
predict()
in this RealTimePredictor requires a numpyndarray
as input. The array should contain the same number of columns as the feature-dimension of the data used to fit the model this Predictor performs inference on.predict()
returns a list ofRecord
objects, one for each row in the inputndarray
. The lower dimension vector result is stored in theprojection
key of theRecord.label
field.Parameters: - endpoint –
- sagemaker_session –