Datetime Feature Engineering: Extraction, Encoding, and Cyclical Features
Definition
Datetime feature engineering transforms temporal data into numerical representations that machine learning models can process. Raw timestamps contain rich information—year, month, day, hour, seasonality, elapsed time—but require extraction and encoding to be useful. Effective datetime engineering captures temporal patterns, trends, and cyclical behaviors that are invisible in raw timestamps. Feature extraction decomposes timestamps into components (year, month, day-of-week, hour). Cyclical encoding represents periodic patterns (time of day, day of week, season) using sine/cosine transformations, preserving the circular nature of time (e.g., hour 23 is close to hour 0). Time since events captures duration and recency effects. Seasonality encoding represents repeating patterns at different granularities (daily, weekly, yearly). Lag features capture temporal dependencies. Rolling windows aggregate historical data. Proper datetime engineering is essential for time series forecasting, churn prediction, fraud detection, demand planning, and any problem where temporal patterns influence outcomes.
Intuition
Think of datetime features like describing a photo's context. The raw timestamp is like saying 'this was taken at 1679875200'—meaningless to most people. Extracting components is like noting 'December 25, 2023, 8 PM'—suddenly we know it's Christmas evening. Cyclical encoding is like understanding that 11 PM is close to midnight which is close to 1 AM—linear encoding (23→0→1) suggests 23 and 1 are far apart, but sine/cosine encoding captures they're adjacent. Time since last purchase captures customer engagement decay. Day-of-week features capture 'weekend shopping' patterns. Lag features ask 'what happened yesterday?'—critical for forecasting. Just as a photo's context matters, the temporal context of data points often contains the most predictive signal.
Mathematical Formula
Step-by-Step Explanation:
- Unix Timestamp: Seconds since epoch, useful for calculating durations and sorting
- Cyclical Encoding: Map cyclical features (hour, day) to unit circle using sine and cosine
- Time Since: Calculate elapsed time between events (recency, age, tenure)
- Week of Year: ISO week numbering for seasonal analysis
- Business Days: Count working days between dates, excluding weekends and holidays
- Lag Features: Values from previous time periods for time series modeling
Real-World Use Cases
Patient readmission prediction uses days since last visit (recency), season (flu season), day-of-week (weekend admissions riskier). Time since medication start for adherence analysis.
Credit scoring uses account age, days since last transaction (activity), month-end patterns (paycheck timing). Cyclical encoding for trading hours. Time between transactions for fraud detection.
Demand forecasting uses day-of-week (weekend spikes), month (seasonal), holidays. Customer recency (days since last purchase) for churn prediction. Time since first purchase (customer lifetime).
Predictive maintenance uses equipment age, operating hours, time since last maintenance. Seasonal patterns (temperature effects). Shift encoding (day/night production differences).
User engagement uses session time-of-day (productivity hours), day-of-week (weekend usage), account age. Time since last login (churn risk). Cyclical patterns for daily active users.
Implementation
Manual Implementation (No Libraries)
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
# Create sample datetime data
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
np.random.seed(42)
data = {
'timestamp': dates,
'value': np.random.randn(len(dates)) * 10 + 100
}
df = pd.DataFrame(data)
# Add some specific timestamps for demonstration
df_ts = pd.DataFrame({
'timestamp': pd.to_datetime([
'2023-01-15 08:30:00',
'2023-06-20 14:45:00',
'2023-12-25 23:15:00',
'2024-03-10 06:00:00'
])
})
print("Sample Timestamps:")
print(df_ts)
# 1. BASIC EXTRACTION (Manual)
def extract_datetime_components_manual(dt_series):
"""
Manual extraction of datetime components.
"""
result = pd.DataFrame(index=dt_series.index)
# Date components
result['year'] = dt_series.dt.year
result['month'] = dt_series.dt.month
result['day'] = dt_series.dt.day
result['dayofweek'] = dt_series.dt.dayofweek # 0=Monday
result['dayofyear'] = dt_series.dt.dayofyear
result['weekofyear'] = dt_series.dt.isocalendar().week.values
# Quarter
result['quarter'] = dt_series.dt.quarter
# Time components
result['hour'] = dt_series.dt.hour
result['minute'] = dt_series.dt.minute
result['second'] = dt_series.dt.second
return result
print("
=== 1. DATETIME COMPONENT EXTRACTION ===")
components = extract_datetime_components_manual(df_ts['timestamp'])
print(components)
# 2. CYCLICAL ENCODING (Manual)
def cyclical_encode_manual(series, period):
"""
Manual cyclical encoding using sine and cosine.
Maps values to unit circle.
"""
result = pd.DataFrame(index=series.index)
# Normalize to [0, 2π]
radians = 2 * np.pi * series / period
# Sine and cosine encoding
result[f'{series.name}_sin'] = np.sin(radians)
result[f'{series.name}_cos'] = np.cos(radians)
return result
print("
=== 2. CYCLICAL ENCODING ===")
# Hour encoding (24-hour cycle)
df_ts['hour'] = df_ts['timestamp'].dt.hour
hour_cyclical = cyclical_encode_manual(df_ts['hour'], period=24)
print("
Hour cyclical encoding:")
print(pd.concat([df_ts[['timestamp', 'hour']], hour_cyclical], axis=1))
# Month encoding (12-month cycle)
df_ts['month'] = df_ts['timestamp'].dt.month
month_cyclical = cyclical_encode_manual(df_ts['month'], period=12)
print("
Month cyclical encoding:")
print(pd.concat([df_ts[['timestamp', 'month']], month_cyclical], axis=1))
# Day of week encoding (7-day cycle)
df_ts['dayofweek'] = df_ts['timestamp'].dt.dayofweek
dow_cyclical = cyclical_encode_manual(df_ts['dayofweek'], period=7)
print("
Day of week cyclical encoding:")
print(pd.concat([df_ts[['timestamp', 'dayofweek']], dow_cyclical], axis=1))
# 3. TIME SINCE / AGE CALCULATIONS
def calculate_time_features_manual(df, timestamp_col, reference_date=None):
"""
Calculate time-based features.
"""
result = df.copy()
ts = result[timestamp_col]
if reference_date is None:
reference_date = pd.Timestamp.now()
else:
reference_date = pd.Timestamp(reference_date)
# Days since/until reference
result['days_since'] = (reference_date - ts).dt.days
result['months_since'] = result['days_since'] / 30.44
result['years_since'] = result['days_since'] / 365.25
# Is weekend
result['is_weekend'] = ts.dt.dayofweek >= 5
# Is month start/end
result['is_month_start'] = ts.dt.is_month_start
result['is_month_end'] = ts.dt.is_month_end
# Is quarter start/end
result['is_quarter_start'] = ts.dt.is_quarter_start
result['is_quarter_end'] = ts.dt.is_quarter_end
# Season (Northern Hemisphere)
month = ts.dt.month
seasons = {12: 0, 1: 0, 2: 0, # Winter
3: 1, 4: 1, 5: 1, # Spring
6: 2, 7: 2, 8: 2, # Summer
9: 3, 10: 3, 11: 3} # Fall
result['season'] = month.map(seasons)
return result
print("
=== 3. TIME FEATURES ===")
df_time = calculate_time_features_manual(df_ts, 'timestamp', reference_date='2024-01-01')
print(df_time[['timestamp', 'days_since', 'months_since', 'is_weekend', 'season']])
# 4. LAG FEATURES
def create_lag_features_manual(series, lags=[1, 7, 30]):
"""
Create lag features for time series.
"""
result = pd.DataFrame(index=series.index)
for lag in lags:
result[f'lag_{lag}'] = series.shift(lag)
return result
print("
=== 4. LAG FEATURES ===")
df['value_shifted'] = df['value'] + np.random.randn(len(df)) * 5
lag_features = create_lag_features_manual(df['value_shifted'], lags=[1, 7, 30])
print(pd.concat([df[['timestamp', 'value_shifted']], lag_features], axis=1).head(10))
# 5. ROLLING WINDOW FEATURES
def create_rolling_features_manual(series, windows=[7, 30, 90]):
"""
Create rolling window statistics.
"""
result = pd.DataFrame(index=series.index)
for window in windows:
result[f'rolling_mean_{window}'] = series.rolling(window=window, min_periods=1).mean()
result[f'rolling_std_{window}'] = series.rolling(window=window, min_periods=1).std()
result[f'rolling_min_{window}'] = series.rolling(window=window, min_periods=1).min()
result[f'rolling_max_{window}'] = series.rolling(window=window, min_periods=1).max()
return result
print("
=== 5. ROLLING WINDOW FEATURES ===")
rolling_features = create_rolling_features_manual(df['value'], windows=[7, 30])
print(pd.concat([df[['timestamp', 'value']], rolling_features], axis=1).head(10))
# 6. EXPANDING WINDOW FEATURES
def create_expanding_features_manual(series):
"""
Create expanding window statistics (growing window from start).
"""
result = pd.DataFrame(index=series.index)
result['expanding_mean'] = series.expanding(min_periods=1).mean()
result['expanding_std'] = series.expanding(min_periods=1).std()
result['expanding_max'] = series.expanding(min_periods=1).max()
result['expanding_sum'] = series.expanding(min_periods=1).sum()
# Cumulative count
result['cumulative_count'] = range(1, len(series) + 1)
return result
print("
=== 6. EXPANDING WINDOW FEATURES ===")
expanding_features = create_expanding_features_manual(df['value'].iloc[:10])
print(pd.concat([df[['timestamp', 'value']].iloc[:10], expanding_features], axis=1))
# 7. TIME DIFFERENCES (deltas)
def calculate_time_deltas_manual(df, timestamp_col, group_col=None):
"""
Calculate time differences between consecutive events.
"""
result = df.copy()
if group_col:
result['time_diff'] = result.groupby(group_col)[timestamp_col].diff().dt.total_seconds() / 3600
else:
result['time_diff'] = result[timestamp_col].diff().dt.total_seconds() / 3600
result['days_diff'] = result['time_diff'] / 24
return result
print("
=== 7. TIME DIFFERENCES ===")
df_delta = calculate_time_deltas_manual(df.iloc[:10], 'timestamp')
print(df_delta[['timestamp', 'time_diff', 'days_diff']])
# 8. BUSINESS DAYS CALCULATION
def count_business_days_manual(start_date, end_date):
"""
Count business days between two dates (simplified, no holidays).
"""
start = pd.Timestamp(start_date)
end = pd.Timestamp(end_date)
business_days = 0
current = start
while current < end:
# Monday=0, Sunday=6
if current.dayofweek < 5:
business_days += 1
current += timedelta(days=1)
return business_days
print("
=== 8. BUSINESS DAYS ===")
start = '2023-01-01'
end = '2023-01-31'
print(f"Business days between {start} and {end}: {count_business_days_manual(start, end)}")
# 9. ELAPSED TIME ENCODING (for time series)
def encode_elapsed_time_manual(series, freq='D'):
"""
Encode time as elapsed units since start.
"""
min_time = series.min()
elapsed = (series - min_time)
if freq == 'D':
return elapsed.dt.days
elif freq == 'H':
return elapsed.dt.total_seconds() / 3600
elif freq == 'M':
return elapsed.dt.days / 30.44
else:
return elapsed.dt.total_seconds()
print("
=== 9. ELAPSED TIME ENCODING ===")
df['elapsed_days'] = encode_elapsed_time_manual(df['timestamp'], freq='D')
print(df[['timestamp', 'elapsed_days']].head())
# 10. SUMMARY OF CYCLICAL ENCODING IMPORTANCE
print("
=== WHY CYCLICAL ENCODING MATTERS ===")
hours = pd.DataFrame({'hour': [23, 0, 1]})
hours['linear'] = hours['hour']
hours['sin'] = np.sin(2 * np.pi * hours['hour'] / 24)
hours['cos'] = np.cos(2 * np.pi * hours['hour'] / 24)
print("Hours 23, 0, 1:")
print(hours)
print("
In linear encoding, 23 and 0 are far apart (distance=23)")
print("In cyclical encoding, 23 and 0 are close (distance on unit circle is small)")
# Calculate Euclidean distances
def circular_distance(h1, h2, period=24):
"""Calculate shortest distance on circle"""
diff = abs(h1 - h2)
return min(diff, period - diff)
print(f"
Circular distance between hour 23 and 0: {circular_distance(23, 0)} hour(s)")
print(f"Circular distance between hour 23 and 1: {circular_distance(23, 1)} hour(s)")
print(f"Linear distance between hour 23 and 0: {abs(23 - 0)} hour(s)")
Using Libraries ()
import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# Create sample data with datetime
np.random.seed(42)
dates = pd.date_range(start='2022-01-01', end='2023-12-31', freq='H')
dates = dates[:5000] # Limit for demo
data = {
'timestamp': dates,
'value': np.sin(np.arange(len(dates)) * 2 * np.pi / 24) * 10 + np.random.randn(len(dates)) * 2 + 50,
'category': np.random.choice(['A', 'B', 'C'], len(dates))
}
df = pd.DataFrame(data)
print("Dataset shape:", df.shape)
print("
Sample data:")
print(df.head())
# 1. PANDAS DATETIME PROPERTIES
print("
" + "="*60)
print("1. PANDAS DATETIME EXTRACTION")
print("="*60)
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['hour'] = df['timestamp'].dt.hour
df['minute'] = df['timestamp'].dt.minute
df['dayofweek'] = df['timestamp'].dt.dayofweek # Monday=0
df['dayofyear'] = df['timestamp'].dt.dayofyear
df['weekofyear'] = df['timestamp'].dt.isocalendar().week.values
df['quarter'] = df['timestamp'].dt.quarter
df['is_month_start'] = df['timestamp'].dt.is_month_start
df['is_month_end'] = df['timestamp'].dt.is_month_end
df['is_quarter_start'] = df['timestamp'].dt.is_quarter_start
df['is_quarter_end'] = df['timestamp'].dt.is_quarter_end
df['is_year_start'] = df['timestamp'].dt.is_year_start
df['is_year_end'] = df['timestamp'].dt.is_year_end
df['is_leap_year'] = df['timestamp'].dt.is_leap_year
df['days_in_month'] = df['timestamp'].dt.days_in_month
print(f"Extracted {len([c for c in df.columns if c not in ['timestamp', 'value', 'category']])} datetime features")
print("
Sample extracted features:")
print(df[['timestamp', 'year', 'month', 'day', 'hour', 'dayofweek', 'quarter']].head())
# 2. CYCLICAL ENCODING WITH SKLEARN
print("
" + "="*60)
print("2. CYCLICAL ENCODING PIPELINE")
print("="*60)
def cyclical_transformer(period):
"""Create cyclical encoding transformer for a given period"""
def encode(X):
X = X.astype(float)
sin_feat = np.sin(2 * np.pi * X / period)
cos_feat = np.cos(2 * np.pi * X / period)
return np.column_stack([sin_feat, cos_feat])
return FunctionTransformer(encode, validate=True)
# Apply cyclical encoding
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['dow_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['doy_sin'] = np.sin(2 * np.pi * df['dayofyear'] / 365)
df['doy_cos'] = np.cos(2 * np.pi * df['dayofyear'] / 365)
print("Cyclical encoding applied:")
print(df[['hour', 'hour_sin', 'hour_cos', 'dayofweek', 'dow_sin', 'dow_cos']].head(8))
# Verify cyclical property
print("
Cyclical property verification:")
print(f"Hour 23: sin={df[df['hour']==23]['hour_sin'].iloc[0]:.4f}, cos={df[df['hour']==23]['hour_cos'].iloc[0]:.4f}")
print(f"Hour 0: sin={df[df['hour']==0]['hour_sin'].iloc[0]:.4f}, cos={df[df['hour']==0]['hour_cos'].iloc[0]:.4f}")
print(f"Hour 1: sin={df[df['hour']==1]['hour_sin'].iloc[0]:.4f}, cos={df[df['hour']==1]['hour_cos'].iloc[0]:.4f}")
print("Note: Hour 23 is close to Hour 0 in cyclical space")
# 3. TIME-BASED FEATURES
print("
" + "="*60
print("3. TIME-BASED DERIVED FEATURES")
print("="*60)
# Weekend/Weekday
df['is_weekend'] = df['dayofweek'].isin([5, 6])
df['is_weekday'] = ~df['is_weekend']
# Business hours
df['is_business_hours'] = (df['hour'] >= 9) & (df['hour'] < 17) & df['is_weekday']
# Part of day
def get_part_of_day(hour):
if 5 <= hour < 12:
return 'morning'
elif 12 <= hour < 17:
return 'afternoon'
elif 17 <= hour < 21:
return 'evening'
else:
return 'night'
df['part_of_day'] = df['hour'].apply(get_part_of_day)
# Season
def get_season(month):
if month in [12, 1, 2]:
return 'winter'
elif month in [3, 4, 5]:
return 'spring'
elif month in [6, 7, 8]:
return 'summer'
else:
return 'fall'
df['season'] = df['month'].apply(get_season)
# Encode as ordinal for modeling
season_map = {'winter': 0, 'spring': 1, 'summer': 2, 'fall': 3}
df['season_ord'] = df['season'].map(season_map)
print("Time-based features:")
print(df[['timestamp', 'is_weekend', 'is_business_hours', 'part_of_day', 'season']].head())
# 4. LAG AND ROLLING FEATURES
print("
" + "="*60)
print("4. LAG AND ROLLING FEATURES")
print("="*60)
# Lag features
for lag in [1, 24, 168]: # 1 hour, 1 day, 1 week
df[f'value_lag_{lag}'] = df['value'].shift(lag)
# Rolling statistics
for window in [6, 24, 168]: # 6 hours, 1 day, 1 week
df[f'value_roll_mean_{window}'] = df['value'].rolling(window=window, min_periods=1).mean()
df[f'value_roll_std_{window}'] = df['value'].rolling(window=window, min_periods=1).std()
df[f'value_roll_min_{window}'] = df['value'].rolling(window=window, min_periods=1).min()
df[f'value_roll_max_{window}'] = df['value'].rolling(window=window, min_periods=1).max()
# Expanding statistics
df['value_expanding_mean'] = df['value'].expanding(min_periods=1).mean()
df['value_expanding_std'] = df['value'].expanding(min_periods=1).std()
# EWMA (Exponentially Weighted Moving Average)
df['value_ewma_24'] = df['value'].ewm(span=24, min_periods=1).mean()
print("Lag and rolling features created:")
lag_cols = [c for c in df.columns if 'lag_' in c or 'roll_' in c or 'expanding' in c or 'ewma' in c]
print(df[['timestamp', 'value'] + lag_cols[:6]].head(10))
# 5. TIME DIFFERENCES AND DURATIONS
print("
" + "="*60)
print("5. TIME DIFFERENCES AND DURATIONS")
print("="*60)
# Time since start
df['elapsed_hours'] = (df['timestamp'] - df['timestamp'].min()).dt.total_seconds() / 3600
df['elapsed_days'] = df['elapsed_hours'] / 24
# Time since last event (per category)
df['time_since_last_cat'] = df.groupby('category')['timestamp'].diff().dt.total_seconds() / 3600
# Inter-event time
df['time_to_next'] = -df['timestamp'].diff(-1).dt.total_seconds() / 3600
print("Time difference features:")
print(df[['timestamp', 'category', 'elapsed_hours', 'time_since_last_cat']].head(10))
# 6. COMPLETE DATETIME PIPELINE
print("
" + "="*60)
print("6. COMPLETE DATETIME PIPELINE")
print("="*60)
class DateTimeFeatureExtractor:
"""
Comprehensive datetime feature extractor for production use.
"""
def __init__(self, datetime_col, cyclical=True, lags=None, rolling_windows=None):
self.datetime_col = datetime_col
self.cyclical = cyclical
self.lags = lags or []
self.rolling_windows = rolling_windows or []
def fit(self, X, y=None):
return self
def transform(self, X):
X = X.copy()
dt = pd.to_datetime(X[self.datetime_col])
# Basic components
X['year'] = dt.dt.year
X['month'] = dt.dt.month
X['day'] = dt.dt.day
X['hour'] = dt.dt.hour
X['dayofweek'] = dt.dt.dayofweek
X['dayofyear'] = dt.dt.dayofyear
X['quarter'] = dt.dt.quarter
X['is_weekend'] = dt.dt.dayofweek >= 5
if self.cyclical:
# Cyclical encoding
X['hour_sin'] = np.sin(2 * np.pi * X['hour'] / 24)
X['hour_cos'] = np.cos(2 * np.pi * X['hour'] / 24)
X['dow_sin'] = np.sin(2 * np.pi * X['dayofweek'] / 7)
X['dow_cos'] = np.cos(2 * np.pi * X['dayofweek'] / 7)
X['month_sin'] = np.sin(2 * np.pi * X['month'] / 12)
X['month_cos'] = np.cos(2 * np.pi * X['month'] / 12)
return X
# Apply pipeline
extractor = DateTimeFeatureExtractor('timestamp', cyclical=True)
df_transformed = extractor.transform(df)
print(f"Features after transformation: {len(df_transformed.columns)}")
print("New columns:", [c for c in df_transformed.columns if c not in df.columns])
# 7. TRAIN-TEST SPLIT CONSIDERATIONS
print("
" + "="*60)
print("7. TIME-BASED TRAIN-TEST SPLIT")
print("="*60)
# Time-based split (prevent data leakage)
split_date = df['timestamp'].quantile(0.8)
print(f"Split date: {split_date}")
train_mask = df['timestamp'] < split_date
test_mask = df['timestamp'] >= split_date
X_train = df[train_mask]
X_test = df[test_mask]
print(f"Train: {len(X_train)} samples ({X_train['timestamp'].min()} to {X_train['timestamp'].max()})")
print(f"Test: {len(X_test)} samples ({X_test['timestamp'].min()} to {X_test['timestamp'].max()})")
print("
Important: Always split by time for time series to prevent data leakage!")
# 8. FEATURE IMPORTANCE ANALYSIS
print("
" + "="*60)
print("8. DATETIME FEATURE IMPORTANCE")
print("="*60)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Prepare features
feature_cols = ['year', 'month', 'day', 'hour', 'dayofweek', 'quarter',
'hour_sin', 'hour_cos', 'dow_sin', 'dow_cos',
'month_sin', 'month_cos', 'is_weekend', 'is_business_hours']
# Remove rows with NaN
df_clean = df[feature_cols + ['value']].dropna()
X = df_clean[feature_cols]
y = df_clean['value']
# Train model
rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(X, y)
# Feature importance
importance = pd.DataFrame({
'feature': feature_cols,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("Feature importance for predicting value:")
print(importance.head(10).to_string(index=False))
# 9. COMPARISON: WITH VS WITHOUT CYCLICAL
print("
" + "="*60)
print("9. CYCLICAL VS LINEAR ENCODING COMPARISON")
print("="*60)
# Linear features only
linear_cols = ['year', 'month', 'day', 'hour', 'dayofweek', 'quarter']
rf_linear = RandomForestRegressor(n_estimators=50, random_state=42)
rf_linear.fit(X[linear_cols], y)
score_linear = rf_linear.score(X[linear_cols], y)
# Cyclical features only
cyclical_cols = ['hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 'month_sin', 'month_cos']
rf_cyclical = RandomForestRegressor(n_estimators=50, random_state=42)
rf_cyclical.fit(X[cyclical_cols], y)
score_cyclical = rf_cyclical.score(X[cyclical_cols], y)
# Combined
rf_combined = RandomForestRegressor(n_estimators=50, random_state=42)
rf_combined.fit(X, y)
score_combined = rf_combined.score(X, y)
print(f"R² Score with linear only: {score_linear:.4f}")
print(f"R² Score with cyclical only: {score_cyclical:.4f}")
print(f"R² Score with combined: {score_combined:.4f}")
# 10. BEST PRACTICES
print("
" + "="*60)
print("10. BEST PRACTICES FOR DATETIME FEATURES")
print("="*60)
best_practices = {
'Practice': [
'Always use cyclical encoding',
'Extract multiple granularities',
'Time-based train-test split',
'Create lag features',
'Use rolling windows',
'Include business logic',
'Handle time zones',
'Consider holidays',
'Log transform durations',
'Validate seasonality'
],
'Description': [
'Encode hour, dayofweek, month as sin/cos to preserve circular nature',
'Extract year, month, day, hour, dayofweek for different patterns',
'Never random split time series; use cutoff date to prevent leakage',
'Previous values are strong predictors for time series',
'Moving averages capture trends and smooth noise',
'Weekend flags, business hours match real-world behavior',
'Convert to common timezone; store original timezone info',
'Holiday indicators often explain anomalous patterns',
'Elapsed times are often skewed; log transform helps',
'Plot ACF/PACF to identify significant seasonal periods'
]
}
print(pd.DataFrame(best_practices).to_string(index=False))
When to Use
✅ Appropriate Use Cases:
- Cyclical encoding: Use for hour, day of week, month, day of year—any periodic feature
- Lag features: Use for time series forecasting, capturing temporal dependencies
- Rolling windows: Use for trend detection, smoothing noise, feature stability
- Elapsed time: Use for customer tenure, equipment age, time since event
- Business features: Use for models affected by business cycles (weekends, holidays, hours)
- Season encoding: Use when yearly patterns exist (retail, energy, agriculture)
❌ Avoid When:
- Don't use raw timestamps as features—models can't interpret them meaningfully
- Avoid linear encoding for cyclical features—23:59 appears far from 00:00
- Don't create lag features without handling missing values at series start
- Avoid future leakage—never use future information to predict past
- Don't ignore timezone—mixing timezones creates meaningless features
- Avoid excessive granularity—minute-level features when hourly patterns suffice cause overfitting
Common Pitfalls
- Data leakage through time-based splits—future data leaking into training
- Not handling missing timestamps—irregular time series need special treatment
- Ignoring daylight saving time—hour duplicates or skips cause issues
- Using datetime features without cyclical encoding—loses circular relationships
- Creating too many lag features—causes multicollinearity and overfitting
- Not validating stationarity—some models assume stable temporal patterns