Categorical Data Analysis: Analyzing Discrete Variables

Beginner Eda
~2 min read Eda

Definition

Categorical data analysis encompasses methods for exploring, summarizing, and testing relationships involving discrete variables. Unlike continuous data, categorical data consists of distinct groups or categories. These variables can be nominal (no inherent order) or ordinal (natural ordering). Key techniques include frequency tables, cross-tabulation, chi-square tests, bar charts, and mosaic plots.

Intuition

💡

Think of categorical analysis like organizing a party. You have different groups of guests (categories). You count how many are in each group (frequency table). You might wonder if certain groups have different preferences (chi-square test). Bar charts are like seating charts - you can instantly see which groups are largest.

Mathematical Formula

Proportion:
\[ \quad \text{Proportion}(c_i) = \frac{\text{Frequency}(c_i)}{n} \]
Expected Frequency:
\[ \quad E_{ij} = \frac{(\text{Row } i \text{ total}) \times (\text{Column } j \text{ total})}{\text{Grand total}} \]
Chi-Square:
\[ \quad \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

Step-by-Step Explanation:

  1. Proportion: Relative frequency from 0 to 1.
  2. Expected frequency: What we would expect if variables were independent.
  3. Chi-square: Sum of squared standardized differences.

Interactive Demo

Grouped bar chart by category Example Data

Real-World Use Cases

Market Research

Analyzing survey responses about brand preference by age group. Cross-tabulation shows which demographics prefer which brands.

Clinical Trials

Testing if treatment outcomes differ between drug and placebo groups using 2x2 contingency tables.

E-commerce

Analyzing purchase categories by customer segment. Cross-tabs reveal which segments buy which products.

Implementation

Manual Implementation (No Libraries)

Frequency tables count occurrences. Chi-square compares observed to expected counts under independence.
from collections import Counter

def frequency_table(data):
    counts = Counter(data)
    total = len(data)
    return {cat: {'freq': count, 'prop': count/total} for cat, count in counts.items()}

def chi_square_2x2(observed):
    a, b = observed[0]
    c, d = observed[1]
    n = a + b + c + d
    expected_a = (a+b)*(a+c)/n
    expected_b = (a+b)*(b+d)/n
    expected_c = (c+d)*(a+c)/n
    expected_d = (c+d)*(b+d)/n
    chi2 = ((a-expected_a)**2/expected_a + (b-expected_b)**2/expected_b +
            (c-expected_c)**2/expected_c + (d-expected_d)**2/expected_d)
    return chi2

fruits = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
print(frequency_table(fruits))
observed = [[30, 20], [25, 35]]
print(f'Chi-square: {chi_square_2x2(observed):.2f}')

Using Libraries (pandas, numpy, scipy, matplotlib)

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)
df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C'], 100),
    'outcome': np.random.choice(['Success', 'Failure'], 100)
})

print(df['category'].value_counts())

crosstab = pd.crosstab(df['category'], df['outcome'])
print(crosstab)

chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f'Chi-square: {chi2:.4f}, p-value: {p:.4f}')

crosstab.plot(kind='bar')
plt.title('Categorical Comparison')
plt.show()

When to Use

✅ Appropriate Use Cases:

  • Frequency tables: Summarizing a single categorical variable
  • Cross-tabulation: Examining relationship between two categorical variables
  • Chi-square test: Testing if variables are independent
  • Bar charts: Comparing frequencies across categories

❌ Avoid When:

  • Do not use chi-square when expected frequencies < 5
  • Do not treat ordinal data as continuous without consideration
  • Do not use pie charts for precise comparison

Common Pitfalls

  • Low expected frequencies
  • Simpson's paradox
  • Confounding variables