Feature Store APIs¶
Feature group¶
-
class
sagemaker.feature_store.feature_group.
FeatureGroup
(name=NOTHING, sagemaker_session=<class 'sagemaker.session.Session'>, feature_definitions=NOTHING)¶ Bases:
object
FeatureGroup definition.
This class instantiates a FeatureGroup object that comprises of a name for the FeatureGroup, session instance, and a list of feature definition objects i.e., FeatureDefinition.
- Parameters
name (str) –
sagemaker_session (sagemaker.session.Session) –
feature_definitions (Sequence[sagemaker.feature_store.feature_definition.FeatureDefinition]) –
- Return type
-
feature_definitions
¶ list of FeatureDefinitions.
- Type
Sequence[FeatureDefinition]
Method generated by attrs for class FeatureGroup.
-
create
(s3_uri, record_identifier_name, event_time_feature_name, role_arn, online_store_kms_key_id=None, enable_online_store=False, offline_store_kms_key_id=None, disable_glue_table_creation=False, data_catalog_config=None, description=None, tags=None, table_format=None)¶ Create a SageMaker FeatureStore FeatureGroup.
- Parameters
s3_uri (Union[str, bool]) – S3 URI of the offline store, set to
False
to disable offline store.record_identifier_name (str) – name of the record identifier feature.
event_time_feature_name (str) – name of the event time feature.
role_arn (str) – ARN of the role used to call CreateFeatureGroup.
online_store_kms_key_id (str) – KMS key id for online store (default: None).
enable_online_store (bool) – whether to enable online store or not (default: False).
offline_store_kms_key_id (str) – KMS key id for offline store (default: None). If a KMS encryption key is not specified, SageMaker encrypts all data at rest using the default AWS KMS key. By defining your bucket-level key for SSE, you can reduce the cost of AWS KMS requests. For more information, see Bucket Key in the Amazon S3 User Guide.
disable_glue_table_creation (bool) – whether to turn off Glue table creation or not (default: False).
data_catalog_config (DataCatalogConfig) – configuration for Metadata store (default: None).
description (str) – description of the FeatureGroup (default: None).
tags (List[Dict[str, str]]) – list of tags for labeling a FeatureGroup (default: None).
table_format (TableFormatEnum) – format of the offline store table (default: None).
- Returns
Response dict from service.
- Return type
Dict[str, Any]
-
delete
()¶ Delete a FeatureGroup.
-
describe
(next_token=None)¶ Describe a FeatureGroup.
-
update
(feature_additions)¶ Update a FeatureGroup and add new features from the given feature definitions.
-
update_feature_metadata
(feature_name, description=None, parameter_additions=None, parameter_removals=None)¶ Update a feature metadata and add/remove metadata.
- Parameters
- Returns
Response dict from service.
- Return type
Dict[str, Any]
-
describe_feature_metadata
(feature_name)¶ Describe feature metadata by feature name.
-
get_record
(record_identifier_value_as_string, feature_names=None)¶ Get a single record in a FeatureGroup
-
put_record
(record)¶ Put a single record in the FeatureGroup.
- Parameters
record (Sequence[FeatureValue]) – a list contains feature values.
-
delete_record
(record_identifier_value_as_string, event_time)¶ Delete a single record from a FeatureGroup.
- Parameters
record_identifier_value_as_string (String) – a String representing the value of the record identifier.
event_time (String) – a timestamp format String indicating when the deletion event occurred.
-
ingest
(data_frame, max_workers=1, max_processes=1, wait=True, timeout=None, profile_name=None)¶ Ingest the content of a pandas DataFrame to feature store.
max_worker
the number of threads created to work on different partitions of thedata_frame
in parallel.max_processes
the number of processes will be created to work on different partitions of thedata_frame
in parallel, each withmax_worker
threads.The ingest function attempts to ingest all records in the data frame. SageMaker Feature Store throws an exception if it fails to ingest any records.
If
wait
isTrue
, Feature Store runs theingest
function synchronously. You receive anIngestionError
if there are any records that can’t be ingested. Ifwait
isFalse
, Feature Store runs theingest
function asynchronously.Instead of setting
wait
toTrue
in theingest
function, you can invoke thewait
function on the returned instance ofIngestionManagerPandas
to run theingest
function synchronously.To access the rows that failed to ingest, set
wait
toFalse
. TheIngestionError.failed_rows
object saves all of the rows that failed to ingest.profile_name argument is an optional one. It will use the default credential if None is passed. This profile_name is used in the sagemaker_featurestore_runtime client only. See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html for more about the default credential.
- Parameters
data_frame (DataFrame) – data_frame to be ingested to feature store.
max_workers (int) – number of threads to be created.
max_processes (int) – number of processes to be created. Each process spawns
max_worker
number of threads.wait (bool) – whether to wait for the ingestion to finish or not.
timeout (Union[int, float]) –
concurrent.futures.TimeoutError
will be raised if timeout is reached.profile_name (str) – the profile credential should be used for
PutRecord
(default: None).
- Returns
An instance of IngestionManagerPandas.
- Return type
sagemaker.feature_store.feature_group.IngestionManagerPandas
-
athena_query
()¶ Create an AthenaQuery instance.
- Returns
An instance of AthenaQuery initialized with data catalog configurations.
- Return type
-
as_hive_ddl
(database='sagemaker_featurestore', table_name=None)¶ Generate Hive DDL commands to define or change structure of tables or databases in Hive.
Schema of the table is generated based on the feature definitions. Columns are named after feature name and data-type are inferred based on feature type. Integral feature type is mapped to INT data-type. Fractional feature type is mapped to FLOAT data-type. String feature type is mapped to STRING data-type.
-
class
sagemaker.feature_store.feature_group.
AthenaQuery
(catalog, database, table_name, sagemaker_session)¶ Bases:
object
Class to manage querying of feature store data with AWS Athena.
This class instantiates a AthenaQuery object that is used to retrieve data from feature store via standard SQL queries.
- Parameters
catalog (str) –
database (str) –
table_name (str) –
sagemaker_session (sagemaker.session.Session) –
- Return type
Method generated by attrs for class AthenaQuery.
-
run
(query_string, output_location, kms_key=None, workgroup=None)¶ Execute a SQL query given a query string, output location and kms key.
This method executes the SQL query using Athena and outputs the results to output_location and returns the execution id of the query.
- Parameters
- Returns
Execution id of the query.
- Return type
-
wait
()¶ Wait for the current query to finish.
-
get_query_execution
()¶ Get execution status of the current query.
- Returns
Response dict from Athena.
- Return type
Dict[str, Any]
-
as_dataframe
()¶ Download the result of the current query and load it into a DataFrame.
- Returns
A pandas DataFrame contains the query result.
- Return type
pandas.core.frame.DataFrame
-
class
sagemaker.feature_store.feature_group.
IngestionManagerPandas
(feature_group_name, sagemaker_fs_runtime_client_config=None, sagemaker_session=None, max_workers=1, max_processes=1, profile_name=None, async_result=None, processing_pool=None, failed_indices=NOTHING)¶ Bases:
object
Class to manage the multi-threaded data ingestion process.
This class will manage the data ingestion process which is multi-threaded.
- Parameters
feature_group_name (str) –
sagemaker_fs_runtime_client_config (botocore.config.Config) –
sagemaker_session (sagemaker.session.Session) –
max_workers (int) –
max_processes (int) –
profile_name (str) –
async_result (multiprocessing.pool.ApplyResult) –
processing_pool (pathos.multiprocessing.ProcessPool) –
failed_indices (List[int]) –
- Return type
-
data_frame
¶ pandas DataFrame to be ingested to the given feature group.
- Type
DataFrame
Method generated by attrs for class IngestionManagerPandas.
-
property
failed_rows
¶ Get rows that failed to ingest.
- Returns
List of row indices that failed to be ingested.
-
wait
(timeout=None)¶ Wait for the ingestion process to finish.
-
run
(data_frame, wait=True, timeout=None)¶ Start the ingestion process.
Feature definition¶
-
class
sagemaker.feature_store.feature_definition.
FeatureDefinition
(feature_name, feature_type)¶ Bases:
sagemaker.feature_store.inputs.Config
Feature definition.
This instantiates a Feature Definition object where FeatureDefinition is a subclass of Config.
- Parameters
feature_name (str) –
feature_type (sagemaker.feature_store.feature_definition.FeatureTypeEnum) –
- Return type
-
feature_type
¶ The type of the feature
- Type
Method generated by attrs for class FeatureDefinition.
-
class
sagemaker.feature_store.feature_definition.
FractionalFeatureDefinition
(feature_name)¶ Bases:
sagemaker.feature_store.feature_definition.FeatureDefinition
Fractional feature definition.
This class instantiates a FractionalFeatureDefinition object, a subclass of FeatureDefinition where the data type of the feature being defined is a Fractional.
-
feature_type
¶ A FeatureTypeEnum.FRACTIONAL type
- Type
Construct an instance of FractionalFeatureDefinition.
- Parameters
feature_name (str) – the name of the feature.
-
-
class
sagemaker.feature_store.feature_definition.
IntegralFeatureDefinition
(feature_name)¶ Bases:
sagemaker.feature_store.feature_definition.FeatureDefinition
Fractional feature definition.
This class instantiates a IntegralFeatureDefinition object, a subclass of FeatureDefinition where the data type of the feature being defined is a Integral.
-
feature_type
¶ a FeatureTypeEnum.INTEGRAL type.
- Type
Construct an instance of IntegralFeatureDefinition.
- Parameters
feature_name (str) – the name of the feature.
-
-
class
sagemaker.feature_store.feature_definition.
StringFeatureDefinition
(feature_name)¶ Bases:
sagemaker.feature_store.feature_definition.FeatureDefinition
Fractional feature definition.
This class instantiates a StringFeatureDefinition object, a subclass of FeatureDefinition where the data type of the feature being defined is a String.
-
feature_type
¶ a FeatureTypeEnum.STRING type.
- Type
Construct an instance of StringFeatureDefinition.
- Parameters
feature_name (str) – the name of the feature.
-
Inputs¶
-
class
sagemaker.feature_store.inputs.
Config
¶ Bases:
abc.ABC
Base config object for FeatureStore.
Configs must implement the to_dict method.
-
class
sagemaker.feature_store.inputs.
DataCatalogConfig
(table_name=NOTHING, catalog=NOTHING, database=NOTHING)¶ Bases:
sagemaker.feature_store.inputs.Config
DataCatalogConfig for FeatureStore.
Method generated by attrs for class DataCatalogConfig.
-
class
sagemaker.feature_store.inputs.
OfflineStoreConfig
(s3_storage_config, disable_glue_table_creation=False, data_catalog_config=None, table_format=None)¶ Bases:
sagemaker.feature_store.inputs.Config
OfflineStoreConfig for FeatureStore.
- Parameters
s3_storage_config (sagemaker.feature_store.inputs.S3StorageConfig) –
disable_glue_table_creation (bool) –
data_catalog_config (sagemaker.feature_store.inputs.DataCatalogConfig) –
table_format (sagemaker.feature_store.inputs.TableFormatEnum) –
- Return type
-
s3_storage_config
¶ configuration of S3 storage.
- Type
-
data_catalog_config
¶ configuration of the data catalog.
- Type
-
table_format
¶ format of the offline store table.
- Type
Method generated by attrs for class OfflineStoreConfig.
-
class
sagemaker.feature_store.inputs.
OnlineStoreConfig
(enable_online_store=True, online_store_security_config=None)¶ Bases:
sagemaker.feature_store.inputs.Config
OnlineStoreConfig for FeatureStore.
- Parameters
enable_online_store (bool) –
online_store_security_config (sagemaker.feature_store.inputs.OnlineStoreSecurityConfig) –
- Return type
-
online_store_security_config
¶ configuration of security setting.
Method generated by attrs for class OnlineStoreConfig.
-
class
sagemaker.feature_store.inputs.
OnlineStoreSecurityConfig
(kms_key_id=NOTHING)¶ Bases:
sagemaker.feature_store.inputs.Config
OnlineStoreSecurityConfig for FeatureStore.
Method generated by attrs for class OnlineStoreSecurityConfig.
-
class
sagemaker.feature_store.inputs.
S3StorageConfig
(s3_uri, kms_key_id=None)¶ Bases:
sagemaker.feature_store.inputs.Config
S3StorageConfig for FeatureStore.
Method generated by attrs for class S3StorageConfig.
-
class
sagemaker.feature_store.inputs.
FeatureValue
(feature_name=None, value_as_string=None)¶ Bases:
sagemaker.feature_store.inputs.Config
FeatureValue for FeatureStore.
Method generated by attrs for class FeatureValue.
Dataset Builder¶
-
class
sagemaker.feature_store.dataset_builder.
DatasetBuilder
(sagemaker_session, base, output_path, record_identifier_feature_name=None, event_time_identifier_feature_name=None, included_feature_names=None, kms_key_id=None, event_time_identifier_feature_type=None)¶ Bases:
object
DatasetBuilder definition.
This class instantiates a DatasetBuilder object that comprises a base, a list of feature names, an output path and a KMS key ID.
- Parameters
sagemaker_session (sagemaker.session.Session) –
base (Union[sagemaker.feature_store.feature_group.FeatureGroup, pandas.core.frame.DataFrame]) –
output_path (str) –
record_identifier_feature_name (str) –
event_time_identifier_feature_name (str) –
included_feature_names (List[str]) –
kms_key_id (str) –
event_time_identifier_feature_type (sagemaker.feature_store.feature_definition.FeatureTypeEnum) –
- Return type
-
_base
¶ A base which can be either a FeatureGroup or a pandas.DataFrame and will be used to merge other FeatureGroups and generate a Dataset.
- Type
Union[FeatureGroup, DataFrame]
-
_record_identifier_feature_name
¶ A string representing the record identifier feature if base is a DataFrame (default: None).
- Type
-
_event_time_identifier_feature_name
¶ A string representing the event time identifier feature if base is a DataFrame (default: None).
- Type
-
_included_feature_names
¶ A list of strings representing features to be included in the output (default: None).
- Type
List[str]
-
_kms_key_id
¶ An KMS key id. If set, will be used to encrypt the result file (default: None).
- Type
-
_point_in_time_accurate_join
¶ A boolean representing whether using point in time join or not (default: False).
- Type
-
_include_duplicated_records
¶ A boolean representing whether including duplicated records or not (default: False).
- Type
-
_include_deleted_records
¶ A boolean representing whether including deleted records or not (default: False).
- Type
-
_number_of_recent_records
¶ An int that how many records will be returned for each record identifier (default: 1).
- Type
-
_write_time_ending_timestamp
¶ A datetime that all records’ write time in dataset will be before it (default: None).
- Type
-
_event_time_starting_timestamp
¶ A datetime that all records’ event time in dataset will be after it (default: None).
- Type
-
_event_time_ending_timestamp
¶ A datetime that all records’ event time in dataset will be before it (default: None).
- Type
-
_feature_groups_to_be_merged
¶ A list of FeatureGroupToBeMerged which will be joined to base (default: []).
- Type
List[FeatureGroupToBeMerged]
-
_event_time_identifier_feature_type
¶ A FeatureTypeEnum representing the type of event time identifier feature (default: None).
- Type
Method generated by attrs for class DatasetBuilder.
-
with_feature_group
(feature_group, target_feature_name_in_base=None, included_feature_names=None)¶ Join FeatureGroup with base.
- Parameters
feature_group (FeatureGroup) – A FeatureGroup which will be joined to base.
target_feature_name_in_base (str) – A string representing the feature name in base which will be used as target join key (default: None).
included_feature_names (List[str]) – A list of strings representing features to be included in the output (default: None).
- Returns
This DatasetBuilder object.
-
point_in_time_accurate_join
()¶ Set join type as point in time accurate join.
- Returns
This DatasetBuilder object.
-
include_duplicated_records
()¶ Include duplicated records in dataset.
- Returns
This DatasetBuilder object.
-
include_deleted_records
()¶ Include deleted records in dataset.
- Returns
This DatasetBuilder object.
-
with_number_of_recent_records_by_record_identifier
(number_of_recent_records)¶ Set number_of_recent_records field with provided input.
- Parameters
number_of_recent_records (int) – An int that how many recent records will be returned for each record identifier.
- Returns
This DatasetBuilder object.
-
with_number_of_records_from_query_results
(number_of_records)¶ Set number_of_records field with provided input.
- Parameters
number_of_records (int) – An int that how many records will be returned.
- Returns
This DatasetBuilder object.
-
as_of
(timestamp)¶ Set write_time_ending_timestamp field with provided input.
- Parameters
timestamp (datetime.datetime) – A datetime that all records’ write time in dataset will be before it.
- Returns
This DatasetBuilder object.
-
with_event_time_range
(starting_timestamp=None, ending_timestamp=None)¶ Set event_time_starting_timestamp and event_time_ending_timestamp with provided inputs.
- Parameters
starting_timestamp (datetime.datetime) – A datetime that all records’ event time in dataset will be after it (default: None).
ending_timestamp (datetime.datetime) – A datetime that all records’ event time in dataset will be before it (default: None).
- Returns
This DatasetBuilder object.
-
to_csv_file
()¶ Get query string and result in .csv format file