Subscribe to our newsletter
📬 Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.
Let's motivate the need for a feature store by chronologically looking at what challenges developers face in their current workflows. Suppose we had a task where we needed to predict something for an entity (ex. user) using their features.
Point-in-time correctness refers to mapping the appropriately up-to-date input feature values to an observed outcome at \(t_{n+1}\). This involves knowing the time (\(t_n\)) that a prediction is needed so we can collect feature values (\(X\)) at that time.
When actually constructing our feature store, there are several core components we need to have to address these challenges:
Each of these components is fairly easy to set up but connecting them all together requires a managed service, SDK layer for interactions, etc. Instead of building from scratch, it's best to leverage one of the production-ready, feature store options such as Feast, Hopsworks, Tecton, Rasgo, etc. And of course, the large cloud providers have their own feature store options as well (Amazon's SageMaker Feature Store, Google's Vertex AI, etc.)
Tip
We highly recommend that you explore this lesson after completing the previous lessons since the topics (and code) are iteratively developed. We did, however, create the feature-store repository for a quick overview with an interactive notebook.
Not all machine learning platforms require a feature store. In fact, our use case is a perfect example of a task that does not benefit from a feature store. All of our data points are independent, stateless, from client-side and there is no entity that has changing features over time. The real utility of a feature store shines when we need to have up-to-date features for an entity that we continually generate predictions for. For example, a user's behavior (clicks, purchases, etc.) on an e-commerce platform or the deliveries a food runner recently made in the last hour, etc.
To answer this question, let's revisit the main challenges that a feature store addresses:
Note
Additionally, if the transformations are compute intensive, then they'll incur a lot of costs by running on duplicate datasets across different applications (as opposed to having a central location with upt-o-date transformed features).
Skew: similar to duplication of efforts, if our transformations can be tied to the model or as a standalone function, then we can just reuse the same pipelines to produce the feature values for training and serving. But this becomes complex and compute intensive as the number of applications, features and transformations grow.
Value: if we aren't working with features that need to be computed server-side (batch or streaming), then we don't have to worry about concepts like point-in-time, etc. However, if we are, a feature store can allow us to retrieve the appropriate feature values across all data sources without the developer having to worry about using disparate tools for different sources (batch, streaming, etc.)
We're going to leverage Feast as the feature store for our application for it's ease of local setup, SDK for training/serving, etc.
👉 Follow along interactive notebook in the feature-store repository as we implement the concepts below.
We're going to create a feature repository at the root of our project. Feast will create a configuration file for us and we're going to add an additional features.py file to define our features.
Traditionally, the feature repository would be it's own isolated repository that other services will use to read/write features from.
The initialized feature repository (with the additional file we've added) will include:
We're going to configure the locations for our registry and online store (SQLite) in our feature_store.yaml file.
If all our feature definitions look valid, Feast will sync the metadata about Feast objects to the registry. The registry is a tiny database storing most of the same information you have in the feature repository. This step is necessary because the production feature serving infrastructure won't be able to access Python files in the feature repository at run time, but it will be able to efficiently and securely read the feature definitions from the registry.
When we run Feast locally, the offline store is effectively represented via Pandas point-in-time joins. Whereas, in production, the offline store can be something more robust like Google BigQuery, Amazon RedShift, etc.
We'll go ahead and paste this into our features/feature_store.yaml file (the notebook cell is automatically do this):
The first step is to establish connections with our data sources (databases, data warehouse, etc.). Feast requires it's data sources to either come from a file (Parquet), data warehouse (BigQuery) or data stream (Kafka / Kinesis). We'll convert our generated features file from the DataOps pipeline (features.json) into a Parquet file, which is a column-major data format that allows fast feature retrieval and caching benefits (contrary to row-major data formats such as CSV where we have to traverse every single row to collect feature values).
1
2 | import os
import pandas as pd
|
1
2
3
4
5
6
7 | # Load labeled projects
projects = pd.read_csv("https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.csv")
tags = pd.read_csv("https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/tags.csv")
df = pd.merge(projects, tags, on="id")
df["text"] = df.title + " " + df.description
df.drop(["title", "description"], axis=1, inplace=True)
df.head(5)
|
1
2 | # Format timestamp
df.created_on = pd.to_datetime(df.created_on)
|
1
2
3
4
5
6
7 | # Convert to parquet
DATA_DIR = Path(os.getcwd(), "data")
df.to_parquet(
Path(DATA_DIR, "features.parquet"),
compression=None,
allow_truncated_timestamps=True,
)
|
Now that we have our data source prepared, we can define our features for the feature store.
1
2
3
4
5 | from datetime import datetime
from pathlib import Path
from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import FileSource
from google.protobuf.duration_pb2 import Duration
|
The first step is to define the location of the features (FileSource in our case) and the timestamp column for each data point.
1
2
3
4
5
6 | # Read data
START_TIME = "2020-02-17"
project_details = FileSource(
path=str(Path(DATA_DIR, "features.parquet")),
event_timestamp_column="created_on",
)
|
Next, we need to define the main entity that each data point pertains to. In our case, each project has a unique ID with features such as text and tags.
1
2
3
4
5
6 | # Define an entity
project = Entity(
name="id",
value_type=ValueType.INT64,
description="project id",
)
|
Finally, we're ready to create a FeatureView that loads specific features (features), of various value types, from a data source (input) for a specific period of time (ttl).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | # Define a Feature View for each project
project_details_view = FeatureView(
name="project_details",
entities=["id"],
ttl=Duration(
seconds=(datetime.today() - datetime.strptime(START_TIME, "%Y-%m-%d")).days * 24 * 60 * 60
),
features=[
Feature(name="text", dtype=ValueType.STRING),
Feature(name="tag", dtype=ValueType.STRING),
],
online=True,
input=project_details,
tags={},
)
|
So let's go ahead and define our feature views by moving this code into our features/features.py script (the notebook cell is automatically do this):
Show codeOnce we've defined our feature views, we can apply it to push a version controlled definition of our features to the registry for fast access. It will also configure our registry and online stores that we've defined in our feature_store.yaml.
Once we've registered our feature definition, along with the data source, entity definition, etc., we can use it to fetch historical features. This is done via joins using the provided timestamps using pandas for our local setup or BigQuery, Hive, etc. as an offline DB for production.
1
2 | import pandas as pd
from feast import FeatureStore
|
1
2
3
4
5
6 | # Identify entities
project_ids = df.id[0:3].to_list()
now = datetime.now()
timestamps = [datetime(now.year, now.month, now.day)]*len(project_ids)
entity_df = pd.DataFrame.from_dict({"id": project_ids, "event_timestamp": timestamps})
entity_df.head()
|
1
2
3
4
5
6
7 | # Get historical features
store = FeatureStore(repo_path="features")
training_df = store.get_historical_features(
entity_df=entity_df,
feature_refs=["project_details:text", "project_details:tag"],
).to_df()
training_df.head()
|
For online inference, we want to retrieve features very quickly via our online store, as opposed to fetching them from slow joins. However, the features are not in our online store just yet, so we'll need to materialize them first.