Feature Engineering: The Craft That Makes ML Models Actually Work
Better features beat better algorithms almost every time. A deep dive into feature engineering — the underrated craft at the heart of practical machine learning.
View all machine learning depths →Depth ladder for this topic:
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.” — Andrew Ng
This quote is from 2012. In 2026, it’s still mostly true.
Yes, deep learning automates a lot of feature engineering. Yes, foundation models have changed the calculus for many problems. But for tabular data, time series, structured domains, and any problem where data is limited — feature engineering remains the highest-leverage skill in applied ML.
This guide covers the craft: what it is, why it matters, and how to actually do it well.
What is a feature?
A feature is any input variable used by a model to make predictions. If you’re predicting customer churn:
account_age_daysis a featurenum_support_tickets_last_30dis a featuredays_since_last_loginis a feature
Feature engineering is the process of using domain knowledge to create, transform, and select features that help the model learn the patterns you care about.
Raw data rarely comes in a form that’s useful for models. Feature engineering is the translation layer between raw data and model-ready inputs.
Why features matter more than algorithms
Consider this: a linear regression model with brilliant features will frequently outperform a gradient-boosted ensemble with poor features on the same problem.
Why? Because algorithms are optimization machines. They find the best possible patterns in the data you give them. If the data doesn’t encode the information needed to make good predictions, no algorithm can conjure it from nothing.
A concrete example: suppose you’re predicting whether a credit card transaction is fraudulent. Raw data includes:
- Transaction amount:
$127.50 - Merchant:
AMZN - Timestamp:
2026-03-05 03:22:14
The timestamp alone is almost useless to a linear model. But engineer it into:
hour_of_day: 3(3am transactions are more suspicious)is_weekend: Falsedays_since_last_transaction: 0.12(made another purchase 3 hours ago)amount_vs_30d_average_ratio: 3.2(3.2x larger than user’s typical purchase)is_new_merchant: True(user has never bought from this merchant)
Now the model has access to patterns a domain expert would recognize. This is feature engineering.
Core techniques
1. Interaction features
Capture relationships between variables:
# These individual features might be weak predictors
df['price'] = ...
df['discount_pct'] = ...
# Their interaction might be a strong predictor
df['effective_price'] = df['price'] * (1 - df['discount_pct'])
df['price_x_discount'] = df['price'] * df['discount_pct']
When two features interact in your domain (price and discount clearly do), multiply, divide, or combine them explicitly. Tree models can learn interactions, but explicit interaction features help linear models and reduce tree depth requirements.
2. Temporal features
Raw timestamps are nearly useless. Decompose them:
from datetime import datetime
# Cyclical encoding preserves periodicity
import numpy as np
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
# Domain-specific time flags
df['is_holiday'] = df['date'].map(holiday_lookup)
df['days_until_event'] = (df['next_event_date'] - df['date']).dt.days
df['is_month_end'] = df['date'].dt.is_month_end
Cyclical encoding (sin/cos) is important for periodic features like hour-of-day and day-of-week. Without it, the model doesn’t know that hour 23 and hour 0 are adjacent.
3. Aggregation features (window features)
Capture historical behavior:
# Rolling aggregations
df['spend_7d'] = df.groupby('user_id')['amount'].transform(
lambda x: x.rolling('7D').sum()
)
df['avg_spend_30d'] = df.groupby('user_id')['amount'].transform(
lambda x: x.rolling('30D').mean()
)
df['txn_count_7d'] = df.groupby('user_id')['amount'].transform(
lambda x: x.rolling('7D').count()
)
These are often the most powerful features in behavioral models. A user’s history is almost always more predictive than their current action in isolation.
Watch out for data leakage: If you’re predicting the future, your window features must only use data from the past. This sounds obvious but is easy to get wrong with pandas.
4. Target encoding
For high-cardinality categorical features (e.g., 10,000 product SKUs), one-hot encoding is impractical. Target encoding replaces each category with the mean target value for that category:
# Mean target per category
target_mean = df.groupby('product_id')['purchased'].mean()
df['product_target_encoded'] = df['product_id'].map(target_mean)
Important: Always use cross-validation folds to compute target encoding. Computing it on the full dataset causes leakage — the model implicitly sees the test labels.
5. Binning continuous features
Sometimes the relationship between a continuous variable and the target isn’t linear. Binning captures nonlinearity explicitly:
# Age groups might be more predictive than raw age
df['age_bin'] = pd.cut(df['age'],
bins=[0, 18, 25, 35, 50, 65, 100],
labels=['<18', '18-25', '25-35', '35-50', '50-65', '65+']
)
This is especially useful when you have domain knowledge about threshold effects. Medical risk increases sharply above certain BMI or blood pressure levels; binning around those thresholds can help.
6. Text features (when you have text columns)
For structured problems that include text fields:
from sklearn.feature_extraction.text import TfidfVectorizer
# Basic TF-IDF
tfidf = TfidfVectorizer(max_features=500, ngram_range=(1, 2))
text_features = tfidf.fit_transform(df['product_description'])
# Or use embedding models for richer representation
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['review_text'].tolist())
For most problems involving short text fields, TF-IDF remains effective and interpretable. For longer text or semantic similarity tasks, sentence transformers are better.
Feature selection: less is often more
After engineering features, you’ll likely have too many. Why this matters:
- Curse of dimensionality: Models trained on many low-signal features can overfit
- Training time: More features = slower training
- Interpretability: Fewer features = clearer understanding of what’s driving predictions
- Production complexity: Every feature is a data pipeline you have to maintain
Selection methods
Correlation analysis: Remove features that are highly correlated with each other (keep the more interpretable one):
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
Feature importance from trees:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
Permutation importance: More reliable than built-in tree importance, especially for correlated features:
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_val, y_val, n_repeats=10)
SHAP values: The gold standard for understanding feature contribution:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values, X_val)
Feature stores: when engineering at scale
If you’re running ML in production and reusing features across multiple models, consider a feature store — a centralized repository for computed, versioned features.
Key benefits:
- Consistency: Training and serving use exactly the same feature computation
- Reuse: Engineering
user_purchase_velocity_7donce and using it in 12 models - Point-in-time correctness: Automatically handles temporal joins to prevent leakage
- Monitoring: Track feature drift in production
Popular feature stores: Feast (open source), Tecton, Hopsworks, AWS SageMaker Feature Store.
For most teams, a feature store becomes necessary when you have 5+ models sharing features in production. Before that, it’s probably over-engineering.
Automated feature engineering
Tools like Featuretools (open source) can automatically generate interaction features:
import featuretools as ft
# Define entity set
es = ft.EntitySet(id='transactions')
es.add_dataframe(dataframe_name='transactions',
dataframe=df,
index='transaction_id',
time_index='timestamp')
# Deep feature synthesis
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='transactions',
trans_primitives=['add_numeric', 'multiply_numeric'],
agg_primitives=['mean', 'sum', 'count'],
max_depth=2
)
AutoML platforms (AutoSklearn, FLAML, H2O AutoML) also incorporate automated feature engineering into their pipelines.
Honest assessment: Automated feature engineering can generate useful features and is worth trying, but human domain knowledge almost always outperforms it. The best approach is using automated tools to explore and then applying domain judgment to select and refine.
The domain knowledge multiplier
The best feature engineers are the ones who deeply understand the domain, not just the tools.
A feature like days_since_last_purchase_in_same_category requires knowing that product categories matter to purchase behavior. A feature like message_sent_at_4am_flag requires knowing that 4am communications are a fraud signal. These insights don’t come from data — they come from understanding the problem.
Before you write a line of feature engineering code, spend time with the domain experts. Read the literature. Understand the failure modes. Ask: what do humans look at when making this decision manually?
The answer to that question is often your most predictive feature.
Summary
- Feature engineering is the translation layer between raw data and model-ready inputs
- Good features beat good algorithms — almost always
- Core techniques: interaction features, temporal decomposition, rolling aggregations, target encoding, binning, text features
- After engineering, use correlation analysis, tree importance, and SHAP values to select the best features
- Feature stores become valuable at production scale (5+ models sharing features)
- Automated tools can help explore, but domain knowledge is the real multiplier
- Ask what humans look at when making this decision — then encode that as a feature
Simplify
← Machine Learning for Builders — Architecture, Trade-offs, and Deployment
Go deeper
The Bias-Variance Tradeoff: Why ML Models Fail in Two Opposite Ways →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.