Datetime Feature Engineering: Extraction, Encoding, and Cyclical Features

Intermediate Preprocessing

~13 min read Preprocessing

Prerequisites:

Feature Engineering: Polynomials, Interactions, Binning, and Domain Features

Definition

Datetime feature engineering transforms temporal data into numerical representations that machine learning models can process. Raw timestamps contain rich information—year, month, day, hour, seasonality, elapsed time—but require extraction and encoding to be useful. Effective datetime engineering captures temporal patterns, trends, and cyclical behaviors that are invisible in raw timestamps. Feature extraction decomposes timestamps into components (year, month, day-of-week, hour). Cyclical encoding represents periodic patterns (time of day, day of week, season) using sine/cosine transformations, preserving the circular nature of time (e.g., hour 23 is close to hour 0). Time since events captures duration and recency effects. Seasonality encoding represents repeating patterns at different granularities (daily, weekly, yearly). Lag features capture temporal dependencies. Rolling windows aggregate historical data. Proper datetime engineering is essential for time series forecasting, churn prediction, fraud detection, demand planning, and any problem where temporal patterns influence outcomes.

Intuition

💡

Think of datetime features like describing a photo's context. The raw timestamp is like saying 'this was taken at 1679875200'—meaningless to most people. Extracting components is like noting 'December 25, 2023, 8 PM'—suddenly we know it's Christmas evening. Cyclical encoding is like understanding that 11 PM is close to midnight which is close to 1 AM—linear encoding (23→0→1) suggests 23 and 1 are far apart, but sine/cosine encoding captures they're adjacent. Time since last purchase captures customer engagement decay. Day-of-week features capture 'weekend shopping' patterns. Lag features ask 'what happened yesterday?'—critical for forecasting. Just as a photo's context matters, the temporal context of data points often contains the most predictive signal.

Mathematical Formula

\text{Unix Timestamp:} \quad t = \text{seconds since 1970-01-01 00:00:00 UTC}

\text{Cyclical Encoding:} \quad x_{sin} = \sin\left(\frac{2\pi x}{\max(x)}\right), \quad x_{cos} = \cos\left(\frac{2\pi x}{\max(x)}\right)

\text{Time Since:} \quad \Delta t = t_{current} - t_{event}

\text{Age:} \quad \text{age} = \frac{t_{now} - t_{birth}}{365.25}

\text{Week of Year:} \quad woy = \left\lfloor \frac{10 + doy - dow}{7} \right\rfloor

\text{where } doy = \text{day of year}, dow = \text{day of week}

Step-by-Step Explanation:

Unix Timestamp: Seconds since epoch, useful for calculating durations and sorting
Cyclical Encoding: Map cyclical features (hour, day) to unit circle using sine and cosine
Time Since: Calculate elapsed time between events (recency, age, tenure)
Week of Year: ISO week numbering for seasonal analysis
Business Days: Count working days between dates, excluding weekends and holidays
Lag Features: Values from previous time periods for time series modeling

Real-World Use Cases

Healthcare

Patient readmission prediction uses days since last visit (recency), season (flu season), day-of-week (weekend admissions riskier). Time since medication start for adherence analysis.

Finance

Credit scoring uses account age, days since last transaction (activity), month-end patterns (paycheck timing). Cyclical encoding for trading hours. Time between transactions for fraud detection.

Retail

Demand forecasting uses day-of-week (weekend spikes), month (seasonal), holidays. Customer recency (days since last purchase) for churn prediction. Time since first purchase (customer lifetime).

Manufacturing

Predictive maintenance uses equipment age, operating hours, time since last maintenance. Seasonal patterns (temperature effects). Shift encoding (day/night production differences).

Tech

User engagement uses session time-of-day (productivity hours), day-of-week (weekend usage), account age. Time since last login (churn risk). Cyclical patterns for daily active users.

Implementation

Manual Implementation (No Libraries)

import numpy as np
import pandas as pd
from datetime import datetime, timedelta

# Create sample datetime data
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
np.random.seed(42)

data = {
    'timestamp': dates,
    'value': np.random.randn(len(dates)) * 10 + 100
}
df = pd.DataFrame(data)

# Add some specific timestamps for demonstration
df_ts = pd.DataFrame({
    'timestamp': pd.to_datetime([
        '2023-01-15 08:30:00',
        '2023-06-20 14:45:00',
        '2023-12-25 23:15:00',
        '2024-03-10 06:00:00'
    ])
})

print("Sample Timestamps:")
print(df_ts)

# 1. BASIC EXTRACTION (Manual)
def extract_datetime_components_manual(dt_series):
    """
    Manual extraction of datetime components.
    """
    result = pd.DataFrame(index=dt_series.index)
    
    # Date components
    result['year'] = dt_series.dt.year
    result['month'] = dt_series.dt.month
    result['day'] = dt_series.dt.day
    result['dayofweek'] = dt_series.dt.dayofweek  # 0=Monday
    result['dayofyear'] = dt_series.dt.dayofyear
    result['weekofyear'] = dt_series.dt.isocalendar().week.values
    
    # Quarter
    result['quarter'] = dt_series.dt.quarter
    
    # Time components
    result['hour'] = dt_series.dt.hour
    result['minute'] = dt_series.dt.minute
    result['second'] = dt_series.dt.second
    
    return result

print("
=== 1. DATETIME COMPONENT EXTRACTION ===")
components = extract_datetime_components_manual(df_ts['timestamp'])
print(components)

# 2. CYCLICAL ENCODING (Manual)
def cyclical_encode_manual(series, period):
    """
    Manual cyclical encoding using sine and cosine.
    Maps values to unit circle.
    """
    result = pd.DataFrame(index=series.index)
    
    # Normalize to [0, 2π]
    radians = 2 * np.pi * series / period
    
    # Sine and cosine encoding
    result[f'{series.name}_sin'] = np.sin(radians)
    result[f'{series.name}_cos'] = np.cos(radians)
    
    return result

print("
=== 2. CYCLICAL ENCODING ===")

# Hour encoding (24-hour cycle)
df_ts['hour'] = df_ts['timestamp'].dt.hour
hour_cyclical = cyclical_encode_manual(df_ts['hour'], period=24)
print("
Hour cyclical encoding:")
print(pd.concat([df_ts[['timestamp', 'hour']], hour_cyclical], axis=1))

# Month encoding (12-month cycle)
df_ts['month'] = df_ts['timestamp'].dt.month
month_cyclical = cyclical_encode_manual(df_ts['month'], period=12)
print("
Month cyclical encoding:")
print(pd.concat([df_ts[['timestamp', 'month']], month_cyclical], axis=1))

# Day of week encoding (7-day cycle)
df_ts['dayofweek'] = df_ts['timestamp'].dt.dayofweek
dow_cyclical = cyclical_encode_manual(df_ts['dayofweek'], period=7)
print("
Day of week cyclical encoding:")
print(pd.concat([df_ts[['timestamp', 'dayofweek']], dow_cyclical], axis=1))

# 3. TIME SINCE / AGE CALCULATIONS
def calculate_time_features_manual(df, timestamp_col, reference_date=None):
    """
    Calculate time-based features.
    """
    result = df.copy()
    ts = result[timestamp_col]
    
    if reference_date is None:
        reference_date = pd.Timestamp.now()
    else:
        reference_date = pd.Timestamp(reference_date)
    
    # Days since/until reference
    result['days_since'] = (reference_date - ts).dt.days
    result['months_since'] = result['days_since'] / 30.44
    result['years_since'] = result['days_since'] / 365.25
    
    # Is weekend
    result['is_weekend'] = ts.dt.dayofweek >= 5
    
    # Is month start/end
    result['is_month_start'] = ts.dt.is_month_start
    result['is_month_end'] = ts.dt.is_month_end
    
    # Is quarter start/end
    result['is_quarter_start'] = ts.dt.is_quarter_start
    result['is_quarter_end'] = ts.dt.is_quarter_end
    
    # Season (Northern Hemisphere)
    month = ts.dt.month
    seasons = {12: 0, 1: 0, 2: 0,  # Winter
               3: 1, 4: 1, 5: 1,   # Spring
               6: 2, 7: 2, 8: 2,   # Summer
               9: 3, 10: 3, 11: 3} # Fall
    result['season'] = month.map(seasons)
    
    return result

print("
=== 3. TIME FEATURES ===")
df_time = calculate_time_features_manual(df_ts, 'timestamp', reference_date='2024-01-01')
print(df_time[['timestamp', 'days_since', 'months_since', 'is_weekend', 'season']])

# 4. LAG FEATURES
def create_lag_features_manual(series, lags=[1, 7, 30]):
    """
    Create lag features for time series.
    """
    result = pd.DataFrame(index=series.index)
    
    for lag in lags:
        result[f'lag_{lag}'] = series.shift(lag)
    
    return result

print("
=== 4. LAG FEATURES ===")
df['value_shifted'] = df['value'] + np.random.randn(len(df)) * 5
lag_features = create_lag_features_manual(df['value_shifted'], lags=[1, 7, 30])
print(pd.concat([df[['timestamp', 'value_shifted']], lag_features], axis=1).head(10))

# 5. ROLLING WINDOW FEATURES
def create_rolling_features_manual(series, windows=[7, 30, 90]):
    """
    Create rolling window statistics.
    """
    result = pd.DataFrame(index=series.index)
    
    for window in windows:
        result[f'rolling_mean_{window}'] = series.rolling(window=window, min_periods=1).mean()
        result[f'rolling_std_{window}'] = series.rolling(window=window, min_periods=1).std()
        result[f'rolling_min_{window}'] = series.rolling(window=window, min_periods=1).min()
        result[f'rolling_max_{window}'] = series.rolling(window=window, min_periods=1).max()
    
    return result

print("
=== 5. ROLLING WINDOW FEATURES ===")
rolling_features = create_rolling_features_manual(df['value'], windows=[7, 30])
print(pd.concat([df[['timestamp', 'value']], rolling_features], axis=1).head(10))

# 6. EXPANDING WINDOW FEATURES
def create_expanding_features_manual(series):
    """
    Create expanding window statistics (growing window from start).
    """
    result = pd.DataFrame(index=series.index)
    
    result['expanding_mean'] = series.expanding(min_periods=1).mean()
    result['expanding_std'] = series.expanding(min_periods=1).std()
    result['expanding_max'] = series.expanding(min_periods=1).max()
    result['expanding_sum'] = series.expanding(min_periods=1).sum()
    
    # Cumulative count
    result['cumulative_count'] = range(1, len(series) + 1)
    
    return result

print("
=== 6. EXPANDING WINDOW FEATURES ===")
expanding_features = create_expanding_features_manual(df['value'].iloc[:10])
print(pd.concat([df[['timestamp', 'value']].iloc[:10], expanding_features], axis=1))

# 7. TIME DIFFERENCES (deltas)
def calculate_time_deltas_manual(df, timestamp_col, group_col=None):
    """
    Calculate time differences between consecutive events.
    """
    result = df.copy()
    
    if group_col:
        result['time_diff'] = result.groupby(group_col)[timestamp_col].diff().dt.total_seconds() / 3600
    else:
        result['time_diff'] = result[timestamp_col].diff().dt.total_seconds() / 3600
    
    result['days_diff'] = result['time_diff'] / 24
    
    return result

print("
=== 7. TIME DIFFERENCES ===")
df_delta = calculate_time_deltas_manual(df.iloc[:10], 'timestamp')
print(df_delta[['timestamp', 'time_diff', 'days_diff']])

# 8. BUSINESS DAYS CALCULATION
def count_business_days_manual(start_date, end_date):
    """
    Count business days between two dates (simplified, no holidays).
    """
    start = pd.Timestamp(start_date)
    end = pd.Timestamp(end_date)
    
    business_days = 0
    current = start
    
    while current < end:
        # Monday=0, Sunday=6
        if current.dayofweek < 5:
            business_days += 1
        current += timedelta(days=1)
    
    return business_days

print("
=== 8. BUSINESS DAYS ===")
start = '2023-01-01'
end = '2023-01-31'
print(f"Business days between {start} and {end}: {count_business_days_manual(start, end)}")

# 9. ELAPSED TIME ENCODING (for time series)
def encode_elapsed_time_manual(series, freq='D'):
    """
    Encode time as elapsed units since start.
    """
    min_time = series.min()
    elapsed = (series - min_time)
    
    if freq == 'D':
        return elapsed.dt.days
    elif freq == 'H':
        return elapsed.dt.total_seconds() / 3600
    elif freq == 'M':
        return elapsed.dt.days / 30.44
    else:
        return elapsed.dt.total_seconds()

print("
=== 9. ELAPSED TIME ENCODING ===")
df['elapsed_days'] = encode_elapsed_time_manual(df['timestamp'], freq='D')
print(df[['timestamp', 'elapsed_days']].head())

# 10. SUMMARY OF CYCLICAL ENCODING IMPORTANCE
print("
=== WHY CYCLICAL ENCODING MATTERS ===")
hours = pd.DataFrame({'hour': [23, 0, 1]})
hours['linear'] = hours['hour']
hours['sin'] = np.sin(2 * np.pi * hours['hour'] / 24)
hours['cos'] = np.cos(2 * np.pi * hours['hour'] / 24)

print("Hours 23, 0, 1:")
print(hours)
print("
In linear encoding, 23 and 0 are far apart (distance=23)")
print("In cyclical encoding, 23 and 0 are close (distance on unit circle is small)")

# Calculate Euclidean distances
def circular_distance(h1, h2, period=24):
    """Calculate shortest distance on circle"""
    diff = abs(h1 - h2)
    return min(diff, period - diff)

print(f"
Circular distance between hour 23 and 0: {circular_distance(23, 0)} hour(s)")
print(f"Circular distance between hour 23 and 1: {circular_distance(23, 1)} hour(s)")
print(f"Linear distance between hour 23 and 0: {abs(23 - 0)} hour(s)")

Using Libraries ()

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Create sample data with datetime
np.random.seed(42)
dates = pd.date_range(start='2022-01-01', end='2023-12-31', freq='H')
dates = dates[:5000]  # Limit for demo

data = {
    'timestamp': dates,
    'value': np.sin(np.arange(len(dates)) * 2 * np.pi / 24) * 10 + np.random.randn(len(dates)) * 2 + 50,
    'category': np.random.choice(['A', 'B', 'C'], len(dates))
}
df = pd.DataFrame(data)

print("Dataset shape:", df.shape)
print("
Sample data:")
print(df.head())

# 1. PANDAS DATETIME PROPERTIES
print("
" + "="*60)
print("1. PANDAS DATETIME EXTRACTION")
print("="*60)

df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['hour'] = df['timestamp'].dt.hour
df['minute'] = df['timestamp'].dt.minute
df['dayofweek'] = df['timestamp'].dt.dayofweek  # Monday=0
df['dayofyear'] = df['timestamp'].dt.dayofyear
df['weekofyear'] = df['timestamp'].dt.isocalendar().week.values
df['quarter'] = df['timestamp'].dt.quarter
df['is_month_start'] = df['timestamp'].dt.is_month_start
df['is_month_end'] = df['timestamp'].dt.is_month_end
df['is_quarter_start'] = df['timestamp'].dt.is_quarter_start
df['is_quarter_end'] = df['timestamp'].dt.is_quarter_end
df['is_year_start'] = df['timestamp'].dt.is_year_start
df['is_year_end'] = df['timestamp'].dt.is_year_end
df['is_leap_year'] = df['timestamp'].dt.is_leap_year
df['days_in_month'] = df['timestamp'].dt.days_in_month

print(f"Extracted {len([c for c in df.columns if c not in ['timestamp', 'value', 'category']])} datetime features")
print("
Sample extracted features:")
print(df[['timestamp', 'year', 'month', 'day', 'hour', 'dayofweek', 'quarter']].head())

# 2. CYCLICAL ENCODING WITH SKLEARN
print("
" + "="*60)
print("2. CYCLICAL ENCODING PIPELINE")
print("="*60)

def cyclical_transformer(period):
    """Create cyclical encoding transformer for a given period"""
    def encode(X):
        X = X.astype(float)
        sin_feat = np.sin(2 * np.pi * X / period)
        cos_feat = np.cos(2 * np.pi * X / period)
        return np.column_stack([sin_feat, cos_feat])
    return FunctionTransformer(encode, validate=True)

# Apply cyclical encoding
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

df['dow_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)

df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

df['doy_sin'] = np.sin(2 * np.pi * df['dayofyear'] / 365)
df['doy_cos'] = np.cos(2 * np.pi * df['dayofyear'] / 365)

print("Cyclical encoding applied:")
print(df[['hour', 'hour_sin', 'hour_cos', 'dayofweek', 'dow_sin', 'dow_cos']].head(8))

# Verify cyclical property
print("
Cyclical property verification:")
print(f"Hour 23: sin={df[df['hour']==23]['hour_sin'].iloc[0]:.4f}, cos={df[df['hour']==23]['hour_cos'].iloc[0]:.4f}")
print(f"Hour 0:  sin={df[df['hour']==0]['hour_sin'].iloc[0]:.4f}, cos={df[df['hour']==0]['hour_cos'].iloc[0]:.4f}")
print(f"Hour 1:  sin={df[df['hour']==1]['hour_sin'].iloc[0]:.4f}, cos={df[df['hour']==1]['hour_cos'].iloc[0]:.4f}")
print("Note: Hour 23 is close to Hour 0 in cyclical space")

# 3. TIME-BASED FEATURES
print("
" + "="*60
print("3. TIME-BASED DERIVED FEATURES")
print("="*60)

# Weekend/Weekday
df['is_weekend'] = df['dayofweek'].isin([5, 6])
df['is_weekday'] = ~df['is_weekend']

# Business hours
df['is_business_hours'] = (df['hour'] >= 9) & (df['hour'] < 17) & df['is_weekday']

# Part of day
def get_part_of_day(hour):
    if 5 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 17:
        return 'afternoon'
    elif 17 <= hour < 21:
        return 'evening'
    else:
        return 'night'

df['part_of_day'] = df['hour'].apply(get_part_of_day)

# Season
def get_season(month):
    if month in [12, 1, 2]:
        return 'winter'
    elif month in [3, 4, 5]:
        return 'spring'
    elif month in [6, 7, 8]:
        return 'summer'
    else:
        return 'fall'

df['season'] = df['month'].apply(get_season)

# Encode as ordinal for modeling
season_map = {'winter': 0, 'spring': 1, 'summer': 2, 'fall': 3}
df['season_ord'] = df['season'].map(season_map)

print("Time-based features:")
print(df[['timestamp', 'is_weekend', 'is_business_hours', 'part_of_day', 'season']].head())

# 4. LAG AND ROLLING FEATURES
print("
" + "="*60)
print("4. LAG AND ROLLING FEATURES")
print("="*60)

# Lag features
for lag in [1, 24, 168]:  # 1 hour, 1 day, 1 week
    df[f'value_lag_{lag}'] = df['value'].shift(lag)

# Rolling statistics
for window in [6, 24, 168]:  # 6 hours, 1 day, 1 week
    df[f'value_roll_mean_{window}'] = df['value'].rolling(window=window, min_periods=1).mean()
    df[f'value_roll_std_{window}'] = df['value'].rolling(window=window, min_periods=1).std()
    df[f'value_roll_min_{window}'] = df['value'].rolling(window=window, min_periods=1).min()
    df[f'value_roll_max_{window}'] = df['value'].rolling(window=window, min_periods=1).max()

# Expanding statistics
df['value_expanding_mean'] = df['value'].expanding(min_periods=1).mean()
df['value_expanding_std'] = df['value'].expanding(min_periods=1).std()

# EWMA (Exponentially Weighted Moving Average)
df['value_ewma_24'] = df['value'].ewm(span=24, min_periods=1).mean()

print("Lag and rolling features created:")
lag_cols = [c for c in df.columns if 'lag_' in c or 'roll_' in c or 'expanding' in c or 'ewma' in c]
print(df[['timestamp', 'value'] + lag_cols[:6]].head(10))

# 5. TIME DIFFERENCES AND DURATIONS
print("
" + "="*60)
print("5. TIME DIFFERENCES AND DURATIONS")
print("="*60)

# Time since start
df['elapsed_hours'] = (df['timestamp'] - df['timestamp'].min()).dt.total_seconds() / 3600
df['elapsed_days'] = df['elapsed_hours'] / 24

# Time since last event (per category)
df['time_since_last_cat'] = df.groupby('category')['timestamp'].diff().dt.total_seconds() / 3600

# Inter-event time
df['time_to_next'] = -df['timestamp'].diff(-1).dt.total_seconds() / 3600

print("Time difference features:")
print(df[['timestamp', 'category', 'elapsed_hours', 'time_since_last_cat']].head(10))

# 6. COMPLETE DATETIME PIPELINE
print("
" + "="*60)
print("6. COMPLETE DATETIME PIPELINE")
print("="*60)

class DateTimeFeatureExtractor:
    """
    Comprehensive datetime feature extractor for production use.
    """
    
    def __init__(self, datetime_col, cyclical=True, lags=None, rolling_windows=None):
        self.datetime_col = datetime_col
        self.cyclical = cyclical
        self.lags = lags or []
        self.rolling_windows = rolling_windows or []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        dt = pd.to_datetime(X[self.datetime_col])
        
        # Basic components
        X['year'] = dt.dt.year
        X['month'] = dt.dt.month
        X['day'] = dt.dt.day
        X['hour'] = dt.dt.hour
        X['dayofweek'] = dt.dt.dayofweek
        X['dayofyear'] = dt.dt.dayofyear
        X['quarter'] = dt.dt.quarter
        X['is_weekend'] = dt.dt.dayofweek >= 5
        
        if self.cyclical:
            # Cyclical encoding
            X['hour_sin'] = np.sin(2 * np.pi * X['hour'] / 24)
            X['hour_cos'] = np.cos(2 * np.pi * X['hour'] / 24)
            X['dow_sin'] = np.sin(2 * np.pi * X['dayofweek'] / 7)
            X['dow_cos'] = np.cos(2 * np.pi * X['dayofweek'] / 7)
            X['month_sin'] = np.sin(2 * np.pi * X['month'] / 12)
            X['month_cos'] = np.cos(2 * np.pi * X['month'] / 12)
        
        return X

# Apply pipeline
extractor = DateTimeFeatureExtractor('timestamp', cyclical=True)
df_transformed = extractor.transform(df)

print(f"Features after transformation: {len(df_transformed.columns)}")
print("New columns:", [c for c in df_transformed.columns if c not in df.columns])

# 7. TRAIN-TEST SPLIT CONSIDERATIONS
print("
" + "="*60)
print("7. TIME-BASED TRAIN-TEST SPLIT")
print("="*60)

# Time-based split (prevent data leakage)
split_date = df['timestamp'].quantile(0.8)
print(f"Split date: {split_date}")

train_mask = df['timestamp'] < split_date
test_mask = df['timestamp'] >= split_date

X_train = df[train_mask]
X_test = df[test_mask]

print(f"Train: {len(X_train)} samples ({X_train['timestamp'].min()} to {X_train['timestamp'].max()})")
print(f"Test: {len(X_test)} samples ({X_test['timestamp'].min()} to {X_test['timestamp'].max()})")
print("
Important: Always split by time for time series to prevent data leakage!")

# 8. FEATURE IMPORTANCE ANALYSIS
print("
" + "="*60)
print("8. DATETIME FEATURE IMPORTANCE")
print("="*60)

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Prepare features
feature_cols = ['year', 'month', 'day', 'hour', 'dayofweek', 'quarter',
                'hour_sin', 'hour_cos', 'dow_sin', 'dow_cos',
                'month_sin', 'month_cos', 'is_weekend', 'is_business_hours']

# Remove rows with NaN
df_clean = df[feature_cols + ['value']].dropna()

X = df_clean[feature_cols]
y = df_clean['value']

# Train model
rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(X, y)

# Feature importance
importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature importance for predicting value:")
print(importance.head(10).to_string(index=False))

# 9. COMPARISON: WITH VS WITHOUT CYCLICAL
print("
" + "="*60)
print("9. CYCLICAL VS LINEAR ENCODING COMPARISON")
print("="*60)

# Linear features only
linear_cols = ['year', 'month', 'day', 'hour', 'dayofweek', 'quarter']
rf_linear = RandomForestRegressor(n_estimators=50, random_state=42)
rf_linear.fit(X[linear_cols], y)
score_linear = rf_linear.score(X[linear_cols], y)

# Cyclical features only
cyclical_cols = ['hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 'month_sin', 'month_cos']
rf_cyclical = RandomForestRegressor(n_estimators=50, random_state=42)
rf_cyclical.fit(X[cyclical_cols], y)
score_cyclical = rf_cyclical.score(X[cyclical_cols], y)

# Combined
rf_combined = RandomForestRegressor(n_estimators=50, random_state=42)
rf_combined.fit(X, y)
score_combined = rf_combined.score(X, y)

print(f"R² Score with linear only: {score_linear:.4f}")
print(f"R² Score with cyclical only: {score_cyclical:.4f}")
print(f"R² Score with combined: {score_combined:.4f}")

# 10. BEST PRACTICES
print("
" + "="*60)
print("10. BEST PRACTICES FOR DATETIME FEATURES")
print("="*60)

best_practices = {
    'Practice': [
        'Always use cyclical encoding',
        'Extract multiple granularities',
        'Time-based train-test split',
        'Create lag features',
        'Use rolling windows',
        'Include business logic',
        'Handle time zones',
        'Consider holidays',
        'Log transform durations',
        'Validate seasonality'
    ],
    'Description': [
        'Encode hour, dayofweek, month as sin/cos to preserve circular nature',
        'Extract year, month, day, hour, dayofweek for different patterns',
        'Never random split time series; use cutoff date to prevent leakage',
        'Previous values are strong predictors for time series',
        'Moving averages capture trends and smooth noise',
        'Weekend flags, business hours match real-world behavior',
        'Convert to common timezone; store original timezone info',
        'Holiday indicators often explain anomalous patterns',
        'Elapsed times are often skewed; log transform helps',
        'Plot ACF/PACF to identify significant seasonal periods'
    ]
}

print(pd.DataFrame(best_practices).to_string(index=False))

When to Use

✅ Appropriate Use Cases:

Cyclical encoding: Use for hour, day of week, month, day of year—any periodic feature
Lag features: Use for time series forecasting, capturing temporal dependencies
Rolling windows: Use for trend detection, smoothing noise, feature stability
Elapsed time: Use for customer tenure, equipment age, time since event
Business features: Use for models affected by business cycles (weekends, holidays, hours)
Season encoding: Use when yearly patterns exist (retail, energy, agriculture)

❌ Avoid When:

Don't use raw timestamps as features—models can't interpret them meaningfully
Avoid linear encoding for cyclical features—23:59 appears far from 00:00
Don't create lag features without handling missing values at series start
Avoid future leakage—never use future information to predict past
Don't ignore timezone—mixing timezones creates meaningless features
Avoid excessive granularity—minute-level features when hourly patterns suffice cause overfitting

Common Pitfalls

Data leakage through time-based splits—future data leaking into training
Not handling missing timestamps—irregular time series need special treatment
Ignoring daylight saving time—hour duplicates or skips cause issues
Using datetime features without cyclical encoding—loses circular relationships
Creating too many lag features—causes multicollinearity and overfitting
Not validating stationarity—some models assume stable temporal patterns

Next Encoding Categorical Variables: Label, One-Hot, Target, and Ordinal Encoding