Encoding Categorical Variables: Label, One-Hot, Target, and Ordinal Encoding
Definition
Categorical variable encoding is the process of converting qualitative data (categories, labels, or discrete values) into numerical representations that machine learning algorithms can process. Most ML models require numeric inputs, making encoding an essential preprocessing step. The choice of encoding method depends on the nature of the categorical variable (nominal vs ordinal), the cardinality (number of unique categories), the relationship between categories and target variable, and the specific algorithm being used. Nominal variables have no intrinsic order (colors, cities, product types) and require encoding that doesn't impose artificial ordering. Ordinal variables have a natural ranking (education level, satisfaction ratings, size categories) where preserving order is important. High-cardinality categorical variables pose special challenges due to the curse of dimensionality with one-hot encoding. Modern techniques like target encoding leverage the relationship between categories and the target variable to create informative numerical representations while reducing dimensionality.
Intuition
Think of categorical encoding like translating between languages. Label encoding assigns each word a unique number—'red'=1, 'blue'=2, 'green'=3—but this implies green is greater than blue, which is nonsense for colors. One-hot encoding is like having separate yes/no checkboxes for each color—each gets its own dedicated space with no false ordering. Target encoding is smarter: instead of arbitrary numbers, use the average outcome for that category—if red cars sell better than blue cars, red gets a higher score. Ordinal encoding preserves ranking like medal positions—gold, silver, bronze clearly have an order. The right method depends on whether your categories have relationships, how many categories exist, and whether your model can handle the resulting representation.
Mathematical Formula
Step-by-Step Explanation:
- Label Encoding: Assign integer 0 to n-1 to each unique category alphabetically or by frequency
- One-Hot Encoding: Create binary vector of length k (number of categories) with 1 at category position
- Target Encoding: Replace category with mean target value for that category, smoothed toward global mean
- Ordinal Encoding: Map ordered categories to integers preserving the natural ranking (e.g., small=0, medium=1, large=2)
- Binary Encoding: Convert category index to binary, create one column per bit (reduces dimensionality vs one-hot)
Real-World Use Cases
Patient blood types (A, B, AB, O) encoded with one-hot for classification models. Severity levels (mild, moderate, severe) use ordinal encoding. Hospital departments (100+ categories) use target encoding based on readmission rates.
Credit card transaction categories (food, travel, electronics) encoded with target encoding using fraud probability. Credit ratings (AAA to D) use ordinal encoding preserving the risk hierarchy.
Product categories with thousands of SKUs use target encoding based on purchase probability. Season categories (spring, summer, fall, winter) use cyclical encoding capturing the circular nature of time.
Machine IDs (high cardinality) encoded with target encoding using failure rate. Quality grades (A, B, C, D) use ordinal encoding where order indicates quality level.
User device types (iOS, Android, Web, Desktop) use one-hot encoding. Operating system versions use ordinal encoding if version order matters, or target encoding based on churn rate.
Implementation
Manual Implementation (No Libraries)
import numpy as np
import pandas as pd
from collections import defaultdict
# Create sample categorical data
data = {
'color': ['red', 'blue', 'green', 'red', 'blue', 'green', 'red', 'blue'],
'size': ['small', 'medium', 'large', 'medium', 'small', 'large', 'large', 'small'],
'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA', 'Chicago', 'NYC', 'LA'],
'target': [1, 0, 1, 1, 0, 0, 1, 0]
}
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
print("
Unique values:")
for col in ['color', 'size', 'city']:
print(f" {col}: {df[col].unique()}")
# 1. LABEL ENCODING (Manual)
def label_encode_manual(series):
"""
Manual label encoding - assign integers 0 to n-1 to categories.
"""
# Get unique categories (sorted for reproducibility)
categories = sorted(series.unique())
# Create mapping
mapping = {cat: idx for idx, cat in enumerate(categories)}
# Apply mapping
encoded = series.map(mapping)
return encoded, mapping
print("
=== 1. LABEL ENCODING (Manual) ===")
df_label = df.copy()
for col in ['color', 'city']:
encoded, mapping = label_encode_manual(df[col])
df_label[f'{col}_encoded'] = encoded
print(f"
{col} mapping: {mapping}")
print("
Label encoded:")
print(df_label[['color', 'color_encoded', 'city', 'city_encoded']])
# 2. ONE-HOT ENCODING (Manual)
def one_hot_encode_manual(series, prefix=None):
"""
Manual one-hot encoding - create binary columns for each category.
"""
categories = sorted(series.unique())
prefix = prefix or series.name
# Create result DataFrame
result = pd.DataFrame(index=series.index)
for cat in categories:
col_name = f"{prefix}_{cat}"
result[col_name] = (series == cat).astype(int)
return result, categories
print("
=== 2. ONE-HOT ENCODING (Manual) ===")
color_onehot, color_cats = one_hot_encode_manual(df['color'])
print("
Color one-hot encoded:")
print(color_onehot)
# Full one-hot for multiple columns
def one_hot_encode_full(df, columns):
result = df.copy()
for col in columns:
onehot, cats = one_hot_encode_manual(df[col], prefix=col)
result = pd.concat([result, onehot], axis=1)
return result
df_onehot = one_hot_encode_full(df, ['color', 'city'])
print("
Full dataset with one-hot encoding:")
print(df_onehot)
# 3. ORDINAL ENCODING (Manual)
def ordinal_encode_manual(series, order):
"""
Manual ordinal encoding - map categories to integers preserving order.
"""
mapping = {cat: idx for idx, cat in enumerate(order)}
encoded = series.map(mapping)
return encoded, mapping
print("
=== 3. ORDINAL ENCODING (Manual) ===")
size_order = ['small', 'medium', 'large']
size_encoded, size_mapping = ordinal_encode_manual(df['size'], size_order)
df_ordinal = df.copy()
df_ordinal['size_encoded'] = size_encoded
print(f"Size order: {size_order}")
print(f"Mapping: {size_mapping}")
print("
Ordinal encoded size:")
print(df_ordinal[['size', 'size_encoded']])
# 4. TARGET ENCODING (Manual)
def target_encode_manual(df, cat_col, target_col, smoothing=1.0):
"""
Manual target encoding with smoothing.
Formula: (count * mean + alpha * global_mean) / (count + alpha)
"""
global_mean = df[target_col].mean()
# Calculate statistics per category
stats = df.groupby(cat_col)[target_col].agg(['mean', 'count'])
# Apply smoothing
smoothed_means = (stats['count'] * stats['mean'] + smoothing * global_mean) / (stats['count'] + smoothing)
# Create mapping
mapping = smoothed_means.to_dict()
# Apply to series
encoded = df[cat_col].map(mapping)
return encoded, mapping, global_mean
print("
=== 4. TARGET ENCODING (Manual) ===")
color_target_encoded, color_mapping, global_mean = target_encode_manual(df, 'color', 'target', smoothing=1.0)
df_target = df.copy()
df_target['color_target_encoded'] = color_target_encoded
print(f"Global mean: {global_mean:.4f}")
print(f"
Color target encoding mapping:")
for cat, val in color_mapping.items():
count = (df['color'] == cat).sum()
raw_mean = df[df['color'] == cat]['target'].mean()
print(f" {cat}: raw_mean={raw_mean:.2f}, smoothed={val:.4f} (n={count})")
print("
Target encoded color:")
print(df_target[['color', 'target', 'color_target_encoded']])
# 5. BINARY ENCODING (Manual)
def binary_encode_manual(series):
"""
Manual binary encoding - convert category index to binary digits.
"""
categories = sorted(series.unique())
n_categories = len(categories)
n_bits = int(np.ceil(np.log2(n_categories)))
# Create mapping from category to index
cat_to_idx = {cat: idx for idx, cat in enumerate(categories)}
# Create result DataFrame
result = pd.DataFrame(index=series.index)
for i in range(n_bits):
col_name = f"{series.name}_bin_{i}"
result[col_name] = series.map(lambda x: (cat_to_idx[x] >> i) & 1)
return result, cat_to_idx, n_bits
print("
=== 5. BINARY ENCODING (Manual) ===")
city_binary, city_idx_map, n_bits = binary_encode_manual(df['city'])
print(f"Categories: {list(city_idx_map.keys())}")
print(f"Index mapping: {city_idx_map}")
print(f"Number of bits: {n_bits}")
print("
City binary encoded:")
print(city_binary)
# 6. FREQUENCY/Count ENCODING (Manual)
def frequency_encode_manual(series):
"""
Manual frequency encoding - replace category with its occurrence count.
"""
freq = series.value_counts()
encoded = series.map(freq)
return encoded, freq.to_dict()
print("
=== 6. FREQUENCY ENCODING (Manual) ===")
color_freq_encoded, freq_mapping = frequency_encode_manual(df['color'])
df_freq = df.copy()
df_freq['color_freq_encoded'] = color_freq_encoded
print(f"Frequency mapping: {freq_mapping}")
print("
Frequency encoded color:")
print(df_freq[['color', 'color_freq_encoded']])
# 7. HASH ENCODING (Manual - Simplified)
def hash_encode_manual(series, n_components=4):
"""
Simplified hash encoding using Python's built-in hash.
In practice, use murmurhash or similar for better distribution.
"""
result = pd.DataFrame(index=series.index)
for i in range(n_components):
col_name = f"{series.name}_hash_{i}"
# Use hash modulo to create binary-like features
result[col_name] = series.apply(lambda x: (hash(x) + i) % 2)
return result
print("
=== 7. HASH ENCODING (Manual - Simplified) ===")
city_hash = hash_encode_manual(df['city'], n_components=4)
print("Hash encoded city (simplified):")
print(city_hash)
# Compare encoding dimensions
print("
=== ENCODING DIMENSION COMPARISON ===")
n_categories = df['city'].nunique()
print(f"City categories: {n_categories}")
print(f"Label encoding: 1 column")
print(f"One-hot encoding: {n_categories} columns")
print(f"Binary encoding: {int(np.ceil(np.log2(n_categories)))} columns")
print(f"Hash encoding (configurable): typically 8-16 columns")
Using Libraries ()
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from category_encoders import TargetEncoder, BinaryEncoder, HashingEncoder
from category_encoders import LeaveOneOutEncoder, CatBoostEncoder, WOEEncoder
import warnings
warnings.filterwarnings('ignore')
# Create comprehensive sample data
np.random.seed(42)
data = {
'color': ['red', 'blue', 'green', 'red', 'blue', 'green', 'red', 'blue'] * 10,
'size': ['small', 'medium', 'large', 'medium', 'small', 'large', 'large', 'small'] * 10,
'city': ['NYC', 'LA', 'Chicago', 'Houston', 'NYC', 'LA', 'Chicago', 'Houston'] * 10,
'department': ['IT', 'HR', 'Sales', 'IT', 'Marketing', 'HR', 'Sales', 'IT'] * 10,
'target': np.random.binomial(1, 0.4, 80)
}
df = pd.DataFrame(data)
# Add high cardinality column
df['product_id'] = ['PROD_' + str(i % 20) for i in range(len(df))]
print("Dataset shape:", df.shape)
print("
Categorical columns summary:")
for col in ['color', 'size', 'city', 'department', 'product_id']:
print(f" {col}: {df[col].nunique()} unique values")
# 1. SKLEARN LABEL ENCODER
print("
" + "="*60)
print("1. SKLEARN LABEL ENCODER")
print("="*60)
df_label = df.copy()
label_encoders = {}
for col in ['color', 'city']:
le = LabelEncoder()
df_label[f'{col}_label'] = le.fit_transform(df[col])
label_encoders[col] = le
print(f"
{col} classes: {le.classes_}")
print(f"Encoded: {df_label[f'{col}_label'].unique()}")
# 2. SKLEARN ORDINAL ENCODER
print("
" + "="*60)
print("2. SKLEARN ORDINAL ENCODER")
print("="*60)
# Define explicit ordering
categories = [['small', 'medium', 'large']]
ore = OrdinalEncoder(categories=categories)
df_ordinal = df.copy()
df_ordinal['size_ordinal'] = ore.fit_transform(df[['size']])
print(f"Size categories order: {ore.categories_}")
print("Encoded values:")
print(df_ordinal[['size', 'size_ordinal']].drop_duplicates().sort_values('size_ordinal'))
# 3. SKLEARN ONE-HOT ENCODER
print("
" + "="*60)
print("3. SKLEARN ONE-HOT ENCODER")
print("="*60)
# For multiple columns
ohe = OneHotEncoder(sparse_output=False, drop=None)
color_onehot = ohe.fit_transform(df[['color']])
color_onehot_df = pd.DataFrame(
color_onehot,
columns=[f'color_{cat}' for cat in ohe.categories_[0]]
)
print(f"One-hot shape: {color_onehot_df.shape}")
print("First 5 rows:")
print(color_onehot_df.head())
# Get feature names including all categories
ohe_multi = OneHotEncoder(sparse_output=False)
encoded_multi = ohe_multi.fit_transform(df[['color', 'city']])
feature_names = ohe_multi.get_feature_names_out(['color', 'city'])
print(f"
Multi-column one-hot shape: {encoded_multi.shape}")
print(f"Feature names: {list(feature_names)}")
# 4. PANDAS GET_DUMMIES
print("
" + "="*60)
print("4. PANDAS GET_DUMMIES")
print("="*60)
df_dummies = pd.get_dummies(df, columns=['color', 'city'], prefix=['col', 'city'])
print(f"Dataframe shape after get_dummies: {df_dummies.shape}")
print(f"New columns: {[c for c in df_dummies.columns if c.startswith(('col_', 'city_'))]}")
print("
First 3 rows of dummy columns:")
dummy_cols = [c for c in df_dummies.columns if c.startswith(('col_', 'city_'))]
print(df_dummies[dummy_cols].head(3))
# 5. TARGET ENCODING (category_encoders)
print("
" + "="*60)
print("5. TARGET ENCODING (category_encoders)")
print("="*60)
# Basic target encoder
te = TargetEncoder(smoothing=1.0)
df_target = df.copy()
df_target['color_target'] = te.fit_transform(df['color'], df['target'])
print("Color target encoding mapping:")
for color in df['color'].unique():
mask = df['color'] == color
raw_mean = df[mask]['target'].mean()
encoded_val = df_target[mask]['color_target'].iloc[0]
print(f" {color}: raw_mean={raw_mean:.3f}, encoded={encoded_val:.3f}")
# Multiple columns
te_multi = TargetEncoder(cols=['color', 'city', 'department'], smoothing=1.0)
df_target_multi = df.copy()
df_target_multi[['color_te', 'city_te', 'dept_te']] = te_multi.fit_transform(
df[['color', 'city', 'department']], df['target']
)
print("
Target encoding for multiple columns applied")
# 6. BINARY ENCODING (category_encoders)
print("
" + "="*60)
print("6. BINARY ENCODING (category_encoders)")
print("="*60)
be = BinaryEncoder(cols=['city'])
df_binary = be.fit_transform(df[['city']])
print(f"City: {df['city'].nunique()} categories → {df_binary.shape[1]} binary columns")
print("Binary encoded:")
print(df_binary.head())
# Compare with one-hot
print(f"
Dimensionality comparison for city ({df['city'].nunique()} categories):")
print(f" One-hot: {df['city'].nunique()} columns")
print(f" Binary: {df_binary.shape[1]} columns")
# 7. HASH ENCODING (category_encoders)
print("
" + "="*60)
print("7. HASH ENCODING (category_encoders)")
print("="*60)
# Hash encoding for high cardinality
he = HashingEncoder(cols=['product_id'], n_components=8)
df_hash = he.fit_transform(df[['product_id']])
print(f"Product_id: {df['product_id'].nunique()} categories → {df_hash.shape[1]} hash columns")
print("Hash encoded (first 5 rows):")
print(df_hash.head())
# 8. ADVANCED ENCODERS
print("
" + "="*60)
print("8. ADVANCED ENCODERS")
print("="*60)
# Leave-One-Out Encoder (prevents target leakage)
looe = LeaveOneOutEncoder(cols=['color'], sigma=0.05)
df_loo = df.copy()
df_loo['color_loo'] = looe.fit_transform(df['color'], df['target'])
print("
Leave-One-Out encoding (prevents overfitting):")
print(df_loo[['color', 'target', 'color_loo']].head(10))
# CatBoost Encoder (ordered target encoding)
cbe = CatBoostEncoder(cols=['color'], sigma=0.05)
df_cb = df.copy()
df_cb['color_cb'] = cbe.fit_transform(df['color'], df['target'])
print("
CatBoost encoding (ordered, reduces leakage):")
print(df_cb[['color', 'target', 'color_cb']].head(10))
# Weight of Evidence Encoder (popular in credit scoring)
try:
woe = WOEEncoder(cols=['city'])
df_woe = df.copy()
df_woe['city_woe'] = woe.fit_transform(df['city'], df['target'])
print("
Weight of Evidence encoding (credit scoring):")
print(df_woe[['city', 'target', 'city_woe']].head(10))
except Exception as e:
print(f"
WOE encoding skipped: {e}")
# 9. COMPARISON AND RECOMMENDATIONS
print("
" + "="*60)
print("9. ENCODING STRATEGY COMPARISON")
print("="*60)
comparison_data = {
'Method': ['Label', 'One-Hot', 'Ordinal', 'Target', 'Binary', 'Hash', 'LOO', 'CatBoost'],
'Preserves Order': ['No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'No'],
'Dimensions': ['1', 'k', '1', '1', 'log2(k)', 'fixed', '1', '1'],
'Target Leakage Risk': ['No', 'No', 'No', 'Yes', 'No', 'No', 'Low', 'Low'],
'Best For': [
'Tree models, high cardinality',
'Linear models, <10 categories',
'Ordinal data',
'High cardinality, large data',
'High cardinality, tree models',
'Very high cardinality',
'Small datasets, CV',
'Gradient boosting'
]
}
comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))
# 10. PRACTICAL EXAMPLE: Complete Pipeline
print("
" + "="*60)
print("10. COMPLETE ENCODING PIPELINE")
print("="*60)
def encode_categorical_pipeline(df, target_col=None):
"""
Comprehensive encoding pipeline based on column characteristics.
"""
df_result = df.copy()
for col in df.select_dtypes(include=['object']).columns:
if col == target_col:
continue
n_unique = df[col].nunique()
is_ordinal = col in ['size'] # Define your ordinal columns
if is_ordinal:
# Ordinal encoding
print(f"{col}: Ordinal encoding")
if col == 'size':
categories = [['small', 'medium', 'large']]
oe = OrdinalEncoder(categories=categories)
df_result[f'{col}_ord'] = oe.fit_transform(df[[col]])
elif n_unique <= 5:
# One-hot for low cardinality
print(f"{col}: One-hot encoding ({n_unique} categories)")
dummies = pd.get_dummies(df[col], prefix=col, drop_first=False)
df_result = pd.concat([df_result, dummies], axis=1)
elif target_col and n_unique <= 100:
# Target encoding for medium cardinality
print(f"{col}: Target encoding ({n_unique} categories)")
te = TargetEncoder(smoothing=1.0)
df_result[f'{col}_te'] = te.fit_transform(df[col], df[target_col])
else:
# Binary encoding for high cardinality
print(f"{col}: Binary encoding ({n_unique} categories)")
be = BinaryEncoder(cols=[col])
encoded = be.fit_transform(df[[col]])
for c in encoded.columns:
df_result[f'{col}_{c}'] = encoded[c]
return df_result
df_encoded = encode_categorical_pipeline(df, target_col='target')
print(f"
Final dataset shape: {df_encoded.shape}")
print(f"Original columns: {len(df.columns)}")
print(f"Final columns: {len(df_encoded.columns)}")
# Save encoders for production
print("
" + "="*60)
print("SAVING ENCODERS FOR PRODUCTION")
print("="*60)
import pickle
# Create and save encoders
encoders = {
'label': LabelEncoder(),
'onehot': OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
'target': TargetEncoder(smoothing=1.0)
}
encoders['label'].fit(df['color'])
encoders['onehot'].fit(df[['color']])
encoders['target'].fit(df['color'], df['target'])
# Save to disk (example)
# with open('encoders.pkl', 'wb') as f:
# pickle.dump(encoders, f)
print("Encoders trained and ready for serialization")
print("
IMPORTANT: Always fit encoders on training data only!")
print("Use the same encoders to transform test/production data.")
When to Use
✅ Appropriate Use Cases:
- Label Encoding: Tree-based models (Random Forest, XGBoost, LightGBM) where order doesn't matter
- One-Hot Encoding: Linear models, neural networks, <10 categories, no natural ordering
- Ordinal Encoding: When categories have clear ranking (education levels, ratings, sizes)
- Target Encoding: High cardinality (>10 categories), large datasets, tree/gradient boosting models
- Binary Encoding: High cardinality, memory constraints, tree-based models
- Hash Encoding: Very high cardinality (>100 categories), streaming data, fixed dimensions needed
❌ Avoid When:
- One-hot encoding with high cardinality (>50 categories)—causes curse of dimensionality
- Label encoding with linear models—creates false ordinality that hurts performance
- Target encoding on small datasets (<1000 samples)—high risk of overfitting
- Target encoding without regularization—leaks target information into features
- Ordinal encoding on nominal data—imposes artificial ordering that confuses models
Common Pitfalls
- Fitting encoders on full dataset before train/test split—leaks information and overestimates performance
- One-hot encoding features with many categories—creates sparse matrices and overfitting
- Not handling unseen categories in production—causes errors when new values appear
- Using label encoding for linear models—models assume ordinality that doesn't exist
- Target encoding without smoothing—rare categories get extreme values based on few samples
- Forgetting to drop one category in one-hot for linear regression—creates multicollinearity