Object2Vec¶

The Amazon SageMaker Object2Vec algorithm.

class sagemaker.Object2Vec(role, train_instance_count, train_instance_type, epochs, enc0_max_seq_len, enc0_vocab_size, enc_dim=None, mini_batch_size=None, early_stopping_patience=None, early_stopping_tolerance=None, dropout=None, weight_decay=None, bucket_width=None, num_classes=None, mlp_layers=None, mlp_dim=None, mlp_activation=None, output_layer=None, optimizer=None, learning_rate=None, negative_sampling_rate=None, comparator_list=None, tied_token_embedding_weight=None, token_embedding_storage_type=None, enc0_network=None, enc1_network=None, enc0_cnn_filter_width=None, enc1_cnn_filter_width=None, enc1_max_seq_len=None, enc0_token_embedding_dim=None, enc1_token_embedding_dim=None, enc1_vocab_size=None, enc0_layers=None, enc1_layers=None, enc0_freeze_pretrained_embedding=None, enc1_freeze_pretrained_embedding=None, **kwargs)¶

Bases: sagemaker.amazon.amazon_estimator.AmazonAlgorithmEstimatorBase

Object2Vec is Estimator used for anomaly detection.

This Estimator may be fit via calls to fit(). There is an utility record_set() that can be used to upload data to S3 and creates RecordSet to be passed to the fit call.

After this Estimator is fit, model data is stored in S3. The model may be deployed to an Amazon SageMaker Endpoint by invoking deploy(). As well as deploying an Endpoint, deploy returns a RealTimePredictor object that can be used for inference calls using the trained model hosted in the SageMaker Endpoint.

Object2Vec Estimators can be configured by setting hyperparameters. The available hyperparameters for Object2Vec are documented below.

For further information on the AWS Object2Vec algorithm, please consult AWS technical documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/object2vec.html

Parameters:

role (str) – An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if accessing AWS resource.
train_instance_count (int) – Number of Amazon EC2 instances to use for training.
train_instance_type (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’.
epochs (int) – Total number of epochs for SGD training
enc0_max_seq_len (int) – Maximum sequence length
enc0_vocab_size (int) – Vocabulary size of tokens
enc_dim (int) – Optional. Dimension of the output of the embedding layer
mini_batch_size (int) – Optional. mini batch size for SGD training
early_stopping_patience (int) – Optional. The allowed number of consecutive epochs without improvement before early stopping is applied
early_stopping_tolerance (float) – Optional. The value used to determine whether the algorithm has made improvement between two consecutive epochs for early stopping
dropout (float) – Optional. Dropout probability on network layers
weight_decay (float) – Optional. Weight decay parameter during optimization
bucket_width (int) – Optional. The allowed difference between data sequence length when bucketing is enabled
num_classes (int) – Optional. Number of classes for classification training (ignored for regression problems)
mlp_layers (int) – Optional. Number of MLP layers in the network
mlp_dim (int) – Optional. Dimension of the output of MLP layer
mlp_activation (str) – Optional. Type of activation function for the MLP layer
output_layer (str) – Optional. Type of output layer
optimizer (str) – Optional. Type of optimizer for training
learning_rate (float) – Optional. Learning rate for SGD training
negative_sampling_rate (int) – Optional. Negative sampling rate
comparator_list (str) – Optional. Customization of comparator operator
tied_token_embedding_weight (bool) – Optional. Tying of token embedding layer weight
token_embedding_storage_type (str) – Optional. Type of token embedding storage
enc0_network (str) – Optional. Network model of encoder “enc0”
enc1_network (str) – Optional. Network model of encoder “enc1”
enc0_cnn_filter_width (int) – Optional. CNN filter width
enc1_cnn_filter_width (int) – Optional. CNN filter width
enc1_max_seq_len (int) – Optional. Maximum sequence length
enc0_token_embedding_dim (int) – Optional. Output dimension of token embedding layer
enc1_token_embedding_dim (int) – Optional. Output dimension of token embedding layer
enc1_vocab_size (int) – Optional. Vocabulary size of tokens
enc0_layers (int) – Optional. Number of layers in encoder
enc1_layers (int) – Optional. Number of layers in encoder
enc0_freeze_pretrained_embedding (bool) – Optional. Freeze pretrained embedding weights
enc1_freeze_pretrained_embedding (bool) – Optional. Freeze pretrained embedding weights
**kwargs – base class keyword argument values.

repo_name = 'object2vec'¶

repo_version = 1¶

MINI_BATCH_SIZE = 32¶

classmethod attach(training_job_name, sagemaker_session=None, model_channel_name='model')¶

Attach to an existing training job.

Create an Estimator bound to an existing training job, each subclass is responsible to implement _prepare_init_params_from_job_description() as this method delegates the actual conversion of a training job description to the arguments that the class constructor expects. After attaching, if the training job has a Complete status, it can be deploy() ed to create a SageMaker Endpoint and return a Predictor.

If the training job is in progress, attach will block and display log messages from the training job, until the training job completes.

Parameters:

training_job_name (str) – The name of the training job to attach to.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.
model_channel_name (str) – Name of the channel where pre-trained model data will be downloaded (default: ‘model’). If no channel with the same name exists in the training job, this option will be ignored.

Examples

>>> my_estimator.fit(wait=False)
>>> training_job_name = my_estimator.latest_training_job.name
Later on:
>>> attached_estimator = Estimator.attach(training_job_name)
>>> attached_estimator.deploy()

Returns:	Instance of the calling `Estimator` Class with the attached training job.

compile_model(target_instance_family, input_shape, output_path, framework=None, framework_version=None, compile_max_run=300, tags=None, **kwargs)¶

Compile a Neo model using the input model.

Parameters:	target_instance_family (str) – Identifies the device that you want to run your model after compilation, for example: ml_c5. Allowed strings are: ml_c5, ml_m5, ml_c4, ml_m4, jetsontx1, jetsontx2, ml_p2, ml_p3, deeplens, rasp3b input_shape (dict) – Specifies the name and shape of the expected inputs for your trained model in json dictionary form, for example: {‘data’:[1,3,1024,1024]}, or {‘var1’: [1,1,28,28], ‘var2’:[1,1,28,28]} output_path (str) – Specifies where to store the compiled model framework (str) – The framework that is used to train the original model. Allowed values: ‘mxnet’, ‘tensorflow’, ‘pytorch’, ‘onnx’, ‘xgboost’ framework_version (str) – The version of the framework compile_max_run (int) – Timeout in seconds for compilation (default: 3 * 60). After this amount of time Amazon SageMaker Neo terminates the compilation job regardless of its current status. tags (list[dict]) – List of tags for labeling a compilation job. For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html. kwargs – Passed to invocation of `create_model()`. Implementations may customize `create_model()` to accept `kwargs` to customize model creation during deploy. For more, see the implementation docs.
Returns:	A SageMaker `Model` object. See `Model()` for full details.
Return type:	sagemaker.model.Model

data_location¶

delete_endpoint()¶

Delete an Amazon SageMaker Endpoint.

Raises:	`ValueError` – If the endpoint does not exist.

deploy(initial_instance_count, instance_type, accelerator_type=None, endpoint_name=None, use_compiled_model=False, update_endpoint=False, wait=True, **kwargs)¶

Deploy the trained model to an Amazon SageMaker endpoint and return a sagemaker.RealTimePredictor object.

More information: http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

Parameters:

initial_instance_count (int) – Minimum number of EC2 instances to deploy to an endpoint for prediction.
instance_type (str) – Type of EC2 instance to deploy to an endpoint for prediction, for example, ‘ml.c4.xlarge’.
accelerator_type (str) – Type of Elastic Inference accelerator to attach to an endpoint for model loading and inference, for example, ‘ml.eia1.medium’. If not specified, no Elastic Inference accelerator will be attached to the endpoint. For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html
endpoint_name (str) – Name to use for creating an Amazon SageMaker endpoint. If not specified, the name of the training job is used.
use_compiled_model (bool) – Flag to select whether to use compiled (optimized) model. Default: False.
update_endpoint (bool) – Flag to update the model in an existing Amazon SageMaker endpoint. If True, this will deploy a new EndpointConfig to an already existing endpoint and delete resources corresponding to the previous EndpointConfig. Default: False
tags (List[dict[str, str]]) – Optional. The list of tags to attach to this specific endpoint. Example: >>> tags = [{‘Key’: ‘tagname’, ‘Value’: ‘tagvalue’}] For more information about tags, see https://boto3.amazonaws.com/v1/documentation /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
wait (bool) – Whether the call should wait until the deployment of model completes (default: True).
**kwargs – Passed to invocation of create_model(). Implementations may customize create_model() to accept **kwargs to customize model creation during deploy. For more, see the implementation docs.

Returns:

A predictor that provides a predict() method,: which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.

Return type:

sagemaker.predictor.RealTimePredictor

enable_network_isolation()¶

Return True if this Estimator will need network isolation to run.

Returns:	Whether this Estimator needs network isolation or not.
Return type:	bool

fit(records, mini_batch_size=None, wait=True, logs=True, job_name=None)¶

Fit this Estimator on serialized Record objects, stored in S3.

records should be an instance of RecordSet. This defines a collection of S3 data files to train this Estimator on.

Training data is expected to be encoded as dense or sparse vectors in the “values” feature on each Record. If the data is labeled, the label is expected to be encoded as a list of scalas in the “values” feature of the Record label.

More information on the Amazon Record format is available at: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

See record_set() to construct a RecordSet object from ndarray arrays.

Parameters:

records (RecordSet) – The records to train this Estimator on
mini_batch_size (int or None) – The size of each mini-batch to use when training. If None, a default value will be used.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Training job name. If not specified, the estimator generates a default job name, based on the training image name and current timestamp.

get_vpc_config(vpc_config_override='VPC_CONFIG_DEFAULT')¶: Returns VpcConfig dict either from this Estimator’s subnets and security groups, or else validate and return an optional override value.

hyperparameters()¶

Return the hyperparameters as a dictionary to use for training.

The fit() method, which trains the model, calls this method to find the hyperparameters.

Returns:	The hyperparameters.
Return type:	dict[str, str]

model_data¶: str – The model location in S3. Only set if Estimator has been fit().

record_set(train, labels=None, channel='train', encrypt=False)¶

Build a RecordSet from a numpy ndarray matrix and label vector.

For the 2D ndarray train, each row is converted to a Record object. The vector is stored in the “values” entry of the features property of each Record. If labels is not None, each corresponding label is assigned to the “values” entry of the labels property of each Record.

The collection of Record objects are protobuf serialized and uploaded to new S3 locations. A manifest file is generated containing the list of objects created and also stored in S3.

The number of S3 objects created is controlled by the train_instance_count property on this Estimator. One S3 object is created per training instance.

Parameters:	train (numpy.ndarray) – A 2D numpy array of training data. labels (numpy.ndarray) – A 1D numpy array of labels. Its length must be equal to the number of rows in `train`. channel (str) – The SageMaker TrainingJob channel this RecordSet should be assigned to. encrypt (bool) – Specifies whether the objects uploaded to S3 are encrypted on the server side using AES-256 (default: `False`).
Returns:	A RecordSet referencing the encoded, uploading training and label data.
Return type:	RecordSet

train_image()¶

Return the Docker image to use for training.

The fit() method, which does the model training, calls this method to find the image to use for model training.

Returns:	The URI of the Docker image.
Return type:	str

training_job_analytics¶: Return a TrainingJobAnalytics object for the current training job.

transformer(instance_count, instance_type, strategy=None, assemble_with=None, output_path=None, output_kms_key=None, accept=None, env=None, max_concurrent_transforms=None, max_payload=None, tags=None, role=None, volume_kms_key=None)¶

Return a Transformer that uses a SageMaker Model based on the training job. It reuses the SageMaker Session and base job name used by the Estimator.

Parameters:

instance_count (int) – Number of EC2 instances to use.
instance_type (str) – Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.
strategy (str) – The strategy used to decide how to batch records in a single request (default: None). Valid values: ‘MULTI_RECORD’ and ‘SINGLE_RECORD’.
assemble_with (str) – How the output is assembled (default: None). Valid values: ‘Line’ or ‘None’.
output_path (str) – S3 location for saving the transform result. If not specified, results are stored to a default bucket.
output_kms_key (str) – Optional. KMS key ID for encrypting the transform output (default: None).
accept (str) – The content type accepted by the endpoint deployed during the transform job.
env (dict) – Environment variables to be set for use during the transform job (default: None).
max_concurrent_transforms (int) – The maximum number of HTTP requests to be made to each individual transform container at one time.
max_payload (int) – Maximum size of the payload in a single HTTP request to the container in MB.
tags (list[dict]) – List of tags for labeling a transform job. If none specified, then the tags used for the training job are used for the transform job.
role (str) – The ExecutionRoleArn IAM Role ARN for the Model, which is also used during transform jobs. If not specified, the role from the Estimator will be used.
volume_kms_key (str) – Optional. KMS key ID for encrypting the volume attached to the ML compute instance (default: None).

negative_sampling_rate¶: An algorithm hyperparameter with optional validation. Implemented as a python descriptor object.

comparator_list¶: An algorithm hyperparameter with optional validation. Implemented as a python descriptor object.

tied_token_embedding_weight¶: An algorithm hyperparameter with optional validation. Implemented as a python descriptor object.

token_embedding_storage_type¶: An algorithm hyperparameter with optional validation. Implemented as a python descriptor object.

create_model(vpc_config_override='VPC_CONFIG_DEFAULT')¶

Return a Object2VecModel referencing the latest s3 model data produced by this Estimator.

Parameters:	vpc_config_override (dict[str, list[str]]) – Optional override for VpcConfig set on the model. Default: use subnets and security groups from this Estimator. * ‘Subnets’ (list[str]): List of subnet ids. * ‘SecurityGroupIds’ (list[str]): List of security group ids.

class sagemaker.Object2VecModel(model_data, role, sagemaker_session=None, **kwargs)¶

Bases: sagemaker.model.Model

Reference Object2Vec s3 model data. Calling deploy() creates an Endpoint and returns a Predictor that calculates anomaly scores for datapoints.