Feature Engineering: The Craft That Makes ML Models Actually Work

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.” — Andrew Ng

This quote is from 2012. In 2026, it’s still mostly true.

Yes, deep learning automates a lot of feature engineering. Yes, foundation models have changed the calculus for many problems. But for tabular data, time series, structured domains, and any problem where data is limited — feature engineering remains the highest-leverage skill in applied ML.

This guide covers the craft: what it is, why it matters, and how to actually do it well.

What is a feature?

A feature is any input variable used by a model to make predictions. If you’re predicting customer churn:

account_age_days is a feature
num_support_tickets_last_30d is a feature
days_since_last_login is a feature

Feature engineering is the process of using domain knowledge to create, transform, and select features that help the model learn the patterns you care about.

Raw data rarely comes in a form that’s useful for models. Feature engineering is the translation layer between raw data and model-ready inputs.

Why features matter more than algorithms

Consider this: a linear regression model with brilliant features will frequently outperform a gradient-boosted ensemble with poor features on the same problem.

Why? Because algorithms are optimization machines. They find the best possible patterns in the data you give them. If the data doesn’t encode the information needed to make good predictions, no algorithm can conjure it from nothing.

A concrete example: suppose you’re predicting whether a credit card transaction is fraudulent. Raw data includes:

Transaction amount: $127.50
Merchant: AMZN
Timestamp: 2026-03-05 03:22:14

The timestamp alone is almost useless to a linear model. But engineer it into:

hour_of_day: 3 (3am transactions are more suspicious)
is_weekend: False
days_since_last_transaction: 0.12 (made another purchase 3 hours ago)
amount_vs_30d_average_ratio: 3.2 (3.2x larger than user’s typical purchase)
is_new_merchant: True (user has never bought from this merchant)

Now the model has access to patterns a domain expert would recognize. This is feature engineering.

Core techniques

1. Interaction features

Capture relationships between variables:

# These individual features might be weak predictors
df['price'] = ...
df['discount_pct'] = ...

# Their interaction might be a strong predictor
df['effective_price'] = df['price'] * (1 - df['discount_pct'])
df['price_x_discount'] = df['price'] * df['discount_pct']

When two features interact in your domain (price and discount clearly do), multiply, divide, or combine them explicitly. Tree models can learn interactions, but explicit interaction features help linear models and reduce tree depth requirements.

2. Temporal features

Raw timestamps are nearly useless. Decompose them:

from datetime import datetime

# Cyclical encoding preserves periodicity
import numpy as np

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)

# Domain-specific time flags
df['is_holiday'] = df['date'].map(holiday_lookup)
df['days_until_event'] = (df['next_event_date'] - df['date']).dt.days
df['is_month_end'] = df['date'].dt.is_month_end

Cyclical encoding (sin/cos) is important for periodic features like hour-of-day and day-of-week. Without it, the model doesn’t know that hour 23 and hour 0 are adjacent.

3. Aggregation features (window features)

Capture historical behavior:

# Rolling aggregations
df['spend_7d'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling('7D').sum()
)
df['avg_spend_30d'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling('30D').mean()
)
df['txn_count_7d'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling('7D').count()
)

These are often the most powerful features in behavioral models. A user’s history is almost always more predictive than their current action in isolation.

Watch out for data leakage: If you’re predicting the future, your window features must only use data from the past. This sounds obvious but is easy to get wrong with pandas.

4. Target encoding

For high-cardinality categorical features (e.g., 10,000 product SKUs), one-hot encoding is impractical. Target encoding replaces each category with the mean target value for that category:

# Mean target per category
target_mean = df.groupby('product_id')['purchased'].mean()
df['product_target_encoded'] = df['product_id'].map(target_mean)

Important: Always use cross-validation folds to compute target encoding. Computing it on the full dataset causes leakage — the model implicitly sees the test labels.

5. Binning continuous features

Sometimes the relationship between a continuous variable and the target isn’t linear. Binning captures nonlinearity explicitly:

# Age groups might be more predictive than raw age
df['age_bin'] = pd.cut(df['age'], 
    bins=[0, 18, 25, 35, 50, 65, 100], 
    labels=['<18', '18-25', '25-35', '35-50', '50-65', '65+']
)

This is especially useful when you have domain knowledge about threshold effects. Medical risk increases sharply above certain BMI or blood pressure levels; binning around those thresholds can help.

6. Text features (when you have text columns)

For structured problems that include text fields:

from sklearn.feature_extraction.text import TfidfVectorizer

# Basic TF-IDF
tfidf = TfidfVectorizer(max_features=500, ngram_range=(1, 2))
text_features = tfidf.fit_transform(df['product_description'])

# Or use embedding models for richer representation
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['review_text'].tolist())

For most problems involving short text fields, TF-IDF remains effective and interpretable. For longer text or semantic similarity tasks, sentence transformers are better.

Feature selection: less is often more

After engineering features, you’ll likely have too many. Why this matters:

Curse of dimensionality: Models trained on many low-signal features can overfit
Training time: More features = slower training
Interpretability: Fewer features = clearer understanding of what’s driving predictions
Production complexity: Every feature is a data pipeline you have to maintain

Selection methods

Correlation analysis: Remove features that are highly correlated with each other (keep the more interpretable one):

corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]

Feature importance from trees:

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

Permutation importance: More reliable than built-in tree importance, especially for correlated features:

from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_val, y_val, n_repeats=10)

SHAP values: The gold standard for understanding feature contribution:

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values, X_val)

Feature stores: when engineering at scale

If you’re running ML in production and reusing features across multiple models, consider a feature store — a centralized repository for computed, versioned features.

Key benefits:

Consistency: Training and serving use exactly the same feature computation
Reuse: Engineering user_purchase_velocity_7d once and using it in 12 models
Point-in-time correctness: Automatically handles temporal joins to prevent leakage
Monitoring: Track feature drift in production

Popular feature stores: Feast (open source), Tecton, Hopsworks, AWS SageMaker Feature Store.

For most teams, a feature store becomes necessary when you have 5+ models sharing features in production. Before that, it’s probably over-engineering.

Automated feature engineering

Tools like Featuretools (open source) can automatically generate interaction features:

import featuretools as ft

# Define entity set
es = ft.EntitySet(id='transactions')
es.add_dataframe(dataframe_name='transactions', 
                 dataframe=df,
                 index='transaction_id',
                 time_index='timestamp')

# Deep feature synthesis
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='transactions',
    trans_primitives=['add_numeric', 'multiply_numeric'],
    agg_primitives=['mean', 'sum', 'count'],
    max_depth=2
)

AutoML platforms (AutoSklearn, FLAML, H2O AutoML) also incorporate automated feature engineering into their pipelines.

Honest assessment: Automated feature engineering can generate useful features and is worth trying, but human domain knowledge almost always outperforms it. The best approach is using automated tools to explore and then applying domain judgment to select and refine.

The domain knowledge multiplier

The best feature engineers are the ones who deeply understand the domain, not just the tools.

A feature like days_since_last_purchase_in_same_category requires knowing that product categories matter to purchase behavior. A feature like message_sent_at_4am_flag requires knowing that 4am communications are a fraud signal. These insights don’t come from data — they come from understanding the problem.

Before you write a line of feature engineering code, spend time with the domain experts. Read the literature. Understand the failure modes. Ask: what do humans look at when making this decision manually?

The answer to that question is often your most predictive feature.

Summary

Feature engineering is the translation layer between raw data and model-ready inputs
Good features beat good algorithms — almost always
Core techniques: interaction features, temporal decomposition, rolling aggregations, target encoding, binning, text features
After engineering, use correlation analysis, tree importance, and SHAP values to select the best features
Feature stores become valuable at production scale (5+ models sharing features)
Automated tools can help explore, but domain knowledge is the real multiplier
Ask what humans look at when making this decision — then encode that as a feature