Time Series Forecasting with Machine Learning: A Practical Guide
Time series forecasting has been transformed by ML approaches. This guide covers when to use ML over statistical methods, which architectures work best, and the practical pitfalls that catch most teams.
View all machine learning depths βDepth ladder for this topic:
Time series forecasting is one of those domains where the gap between βworks in a notebookβ and βworks in productionβ is enormous. Statistical methods like ARIMA have decades of reliability behind them. ML methods promise better accuracy but come with complexity. Hereβs how to navigate the tradeoffs.
When ML Beats Statistical Methods
Not always. Thatβs the honest answer. For simple, univariate series with strong seasonal patterns and stable trends, ARIMA or exponential smoothing often wins. ML shines when:
- Multiple input features influence the forecast (price, weather, promotions, holidays)
- Complex nonlinear patterns exist that decomposition canβt capture
- Cross-series learning is possible (forecasting 10,000 SKUs, where patterns transfer)
- Long horizons are needed relative to the data frequency
- Irregular or missing data is common
The ML Forecasting Toolkit
Gradient Boosted Trees (XGBoost, LightGBM)
Still the workhorse for tabular time series. The trick is feature engineering:
def create_time_features(df, target_col, lags=[1,7,14,28]):
features = pd.DataFrame(index=df.index)
# Lag features
for lag in lags:
features[f'lag_{lag}'] = df[target_col].shift(lag)
# Rolling statistics
for window in [7, 14, 28]:
features[f'rolling_mean_{window}'] = (
df[target_col].shift(1).rolling(window).mean()
)
features[f'rolling_std_{window}'] = (
df[target_col].shift(1).rolling(window).std()
)
# Calendar features
features['day_of_week'] = df.index.dayofweek
features['month'] = df.index.month
features['is_weekend'] = (df.index.dayofweek >= 5).astype(int)
return features
Key rule: never use future information in features. Every feature must use .shift(1) or greater to avoid data leakage. This is the single most common mistake in time series ML.
Foundation Models (TimesFM, Chronos, Moirai)
The newest entrant. Pre-trained on millions of time series, these models can forecast zero-shot β no training on your specific data. Theyβre surprisingly good for:
- Cold-start problems (new product, no history)
- Quick baselines
- Series where you have too little data for custom training
Current limitations: they struggle with strong exogenous features and very domain-specific patterns.
Deep Learning (N-BEATS, TFT, PatchTST)
Temporal Fusion Transformers (TFT) remain the strongest deep learning option for complex forecasting with multiple features. They handle:
- Static covariates (store location, product category)
- Known future inputs (holidays, planned promotions)
- Unknown future inputs (weather forecasts, economic indicators)
- Multi-horizon outputs with quantile predictions
The downside: they need significant data (thousands of series or long history) and careful tuning.
The Evaluation Trap
Time series evaluation requires temporal cross-validation, not random splitting:
Training β Validation β Test
βββββββββββββββ€βββββββββββββ€βββββββββ
[Jan-Jun] β [Jul-Aug] β [Sep-Oct]
[Jan-Aug] β [Sep-Oct] β [Nov-Dec]
Use expanding or sliding windows. Random k-fold will massively overestimate performance because it leaks temporal information.
Metrics that matter:
- MASE (Mean Absolute Scaled Error): Compares against naive baseline, scale-independent
- WAPE (Weighted Absolute Percentage Error): Business-friendly, handles zeros
- Quantile losses: If you need prediction intervals (you probably do)
Avoid MAPE for series that cross zero or have small values β it explodes.
Production Pitfalls
Concept Drift
Time series patterns change. COVID broke every demand model in existence. Build monitoring that detects when your forecast error distribution shifts significantly from validation performance.
Retraining Cadence
How often should you retrain? The answer depends on how fast your dataβs patterns change. Options:
- Fixed schedule: Retrain weekly/monthly regardless
- Triggered: Retrain when error exceeds a threshold
- Online learning: Update incrementally with each new observation
Most teams start with fixed schedule and add triggered retraining as they mature.
The Aggregation Question
Forecasting daily vs. weekly vs. monthly isnβt just about granularity β it fundamentally changes difficulty. Daily forecasts are noisier. Monthly forecasts lose signal. The right level depends on the decision being made. If youβre planning staffing, daily matters. If youβre ordering inventory, weekly might suffice.
Ensembles Win
In production, ensemble a simple statistical model with your ML model. The statistical model provides stability and interpretability; the ML model captures complex patterns. A weighted average often beats either alone.
Getting Started
- Start with a naive baseline (last value, seasonal naive)
- Try a statistical method (ETS or ARIMA via
statsforecast) - Add gradient boosted trees with time features
- Consider deep learning only if you have enough data and complexity
- Ensemble the best performers
- Monitor in production β your model will degrade
The models that win Kaggle competitions are rarely the models that survive in production. Optimize for reliability and maintainability, not just accuracy.
Simplify
β Machine Learning Monitoring Playbook for Production Teams
Go deeper
Machine Learning for Builders β Architecture, Trade-offs, and Deployment β
Related reads
Stay ahead of the AI curve
Weekly insights on AI β explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.