API Data Ingestion: REST APIs and JSON Parsing

Definition

API (Application Programming Interface) data ingestion is the process of programmatically extracting data from external services and systems through HTTP requests to RESTful endpoints. REST (Representational State Transfer) APIs use standard HTTP methods (GET, POST, PUT, DELETE) to perform CRUD (Create, Read, Update, Delete) operations on resources identified by URLs. Data is typically exchanged in JSON (JavaScript Object Notation) format, though XML and other formats are also used. The workflow involves constructing HTTP requests with appropriate headers (including authentication tokens, API keys, or OAuth credentials), handling query parameters for filtering and pagination, managing rate limits and request throttling, processing HTTP response codes (200 OK, 404 Not Found, 429 Too Many Requests, 500 Server Error), and parsing JSON responses into Python dictionaries or pandas DataFrames. Advanced patterns include implementing retry logic with exponential backoff, handling pagination through offset/limit or cursor-based approaches, batching multiple requests, and managing API authentication securely. Understanding HTTP status codes, content negotiation, and REST API design principles is essential for building robust data pipelines that integrate external data sources.

Intuition

💡

Think of an API like a restaurant with a standardized menu and ordering system. The restaurant (server) offers specific dishes (data resources) you can order, listed on their menu (API documentation). You (the client) place an order by speaking to the waiter (making an HTTP request) using specific phrases from the menu. The waiter takes your order to the kitchen (server processing), and returns with your meal (JSON response). Different order types work like HTTP methods: checking the menu is like GET, customizing your order is like POST, changing an existing order is like PUT, and canceling is like DELETE. Authentication is like showing your membership card - some restaurants (APIs) are open to all, others require credentials. Rate limiting is like the restaurant saying 'maximum 5 orders per minute' - you need to pace your requests. Pagination is like ordering from a buffet with multiple trays - if there's too much food for one tray, the waiter brings them one at a time, and you ask for 'the next tray' until you've gotten everything. JSON is like the standardized recipe format that describes each dish's ingredients in a structured way that any restaurant can understand.

Mathematical Formula

Backoff time: Wait = Base * 2^Attempt + Jitter

Step-by-Step Explanation:

Rate limiting determines how many requests you can make within a time window; exceeding limits results in 429 Too Many Requests errors
Pagination requires calculating total pages needed by dividing total records by page size and rounding up to include partial pages
Exponential backoff increases wait time between retries exponentially to avoid overwhelming the server, with random jitter to prevent synchronized retries

Real-World Use Cases

Finance

Trading platforms ingest real-time stock prices and market data from APIs like Alpha Vantage or Yahoo Finance. They handle authentication with API keys, process paginated historical data, and parse JSON responses to update their databases and trading algorithms.

Retail

E-commerce companies use APIs to sync inventory with suppliers, fetch product details from manufacturers, and integrate with shipping carriers for tracking information. They implement retry logic and batch processing to handle large catalogs efficiently.

Tech

Social media analytics platforms ingest posts, comments, and engagement metrics from platforms like Twitter/X, Reddit, or Instagram APIs. They handle rate limiting, OAuth authentication, and pagination to build comprehensive datasets for sentiment analysis.

Implementation

Manual Implementation (No Libraries)

The manual implementation demonstrates how HTTP requests and JSON parsing work under the hood. Using urllib shows the low-level HTTP request construction including headers and SSL context. The simple_json_parser uses Python's built-in json module which handles the complex parsing of nested structures. The flatten_json function demonstrates how nested JSON (common in API responses) is converted to flat dictionaries suitable for DataFrames. This reveals the complexity that libraries like requests abstract away.

import json
import urllib.request
import urllib.parse
import ssl

def manual_get_request(url, headers=None):
    """Make a simple HTTP GET request using built-in libraries"""
    request = urllib.request.Request(url)
    if headers:
        for key, value in headers.items():
            request.add_header(key, value)
    
    # Create SSL context that allows us to connect to HTTPS
    context = ssl.create_default_context()
    
    with urllib.request.urlopen(request, context=context) as response:
        data = response.read().decode('utf-8')
        return {
            'status': response.status,
            'headers': dict(response.headers),
            'body': data,
            'json': json.loads(data) if data else None
        }

def simple_json_parser(text):
    """Simple JSON parser using built-in library"""
    return json.loads(text)

def flatten_json(nested_dict, prefix=''):
    """Flatten nested JSON structure for DataFrame conversion"""
    items = []
    for key, value in nested_dict.items():
        new_key = prefix + '.' + key if prefix else key
        if isinstance(value, dict):
            items.extend(flatten_json(value, new_key).items())
        else:
            items.append((new_key, value))
    return dict(items)

# Example JSON handling
sample_json = '{"name": "Alice", "age": 30, "address": {"city": "NYC", "zip": "10001"}}'
parsed = simple_json_parser(sample_json)
flattened = flatten_json(parsed)
print(flattened)

Using Libraries (requests, pandas)

import requests
import pandas as pd
import time

# Basic GET request
# response = requests.get('https://api.example.com/data')
# data = response.json()

# Request with headers and parameters
headers = {'Authorization': 'Bearer YOUR_API_TOKEN', 'Accept': 'application/json'}
params = {'page': 1, 'limit': 100, 'sort': 'created_at'}

# response = requests.get('https://api.example.com/items', headers=headers, params=params)

# POST request with JSON payload
payload = {'name': 'Alice', 'email': 'alice@example.com'}
# response = requests.post('https://api.example.com/users', json=payload, headers=headers)

def fetch_with_retry(url, max_retries=3, backoff=1.0):
    """Fetch URL with exponential backoff retry logic"""
    for attempt in range(max_retries + 1):
        try:
            response = requests.get(url, timeout=30)
            if response.status_code == 200:
                return response
            if 400 <= response.status_code < 500:
                return response  # Don't retry client errors
            if response.status_code >= 500 and attempt < max_retries:
                wait = backoff * (2 ** attempt)
                time.sleep(wait)
        except requests.RequestException:
            if attempt < max_retries:
                wait = backoff * (2 ** attempt)
                time.sleep(wait)
    return None

def fetch_all_pages(base_url, headers=None):
    """Fetch all paginated results"""
    all_data = []
    page = 1
    while True:
        params = {'page': page, 'limit': 100}
        response = requests.get(base_url, headers=headers, params=params)
        if response.status_code != 200:
            break
        data = response.json()
        if not data or len(data) == 0:
            break
        all_data.extend(data)
        page += 1
        time.sleep(0.1)  # Rate limiting
    return all_data

# Convert JSON to DataFrame
sample_data = [
    {'id': 1, 'name': 'Alice', 'age': 30, 'department': 'Engineering'},
    {'id': 2, 'name': 'Bob', 'age': 25, 'department': 'Sales'}
]
df = pd.DataFrame(sample_data)

# Handle nested JSON
nested_data = [
    {'id': 1, 'name': 'Product A', 'metadata': {'price': 99.99, 'category': 'Electronics'}},
    {'id': 2, 'name': 'Product B', 'metadata': {'price': 149.99, 'category': 'Electronics'}}
]
df_nested = pd.json_normalize(nested_data, sep='_')
print(df_nested)

When to Use

✅ Appropriate Use Cases:

Fetching real-time or near real-time data from external services like weather APIs, stock prices, or social media feeds
Integrating with SaaS platforms (CRM, marketing automation, accounting) via their REST APIs for data synchronization
Building ETL pipelines that pull data from multiple microservices or external data providers
Ingesting data from IoT devices or sensors that expose data via HTTP endpoints
Populating data lakes or warehouses with external data sources that don't offer direct database access
Automating data collection from web services on a schedule (hourly, daily) for reporting and analytics

❌ Avoid When:

When you have direct database access - database connections are more efficient than API calls for bulk data
For real-time streaming with sub-second latency requirements - use WebSockets, gRPC, or message queues instead of REST polling
When API rate limits prevent meaningful data volumes - consider requesting bulk data exports or database dumps instead
For internal service communication within a microservices architecture - use message buses or gRPC for better performance
When data changes very frequently and you need push notifications - use webhooks or WebSockets instead of polling
If the API requires complex authentication flows (OAuth2 with MFA) that are difficult to automate

Common Pitfalls

Not handling rate limits (429 errors) - implement exponential backoff and check Retry-After headers to respect API limits.
Hardcoding API keys in code - use environment variables or secret managers; never commit credentials to version control.
Not validating response data before parsing - APIs can return unexpected formats; always check status codes and content types.
Fetching all data in one request without pagination - implement pagination to handle large datasets and respect API limits.
Not implementing retry logic for transient failures - network issues and temporary 500 errors should be retried with backoff.
Forgetting to close HTTP connections - use requests.Session() for connection pooling and always close resources.
Not handling timeouts - always set timeout parameters to prevent hanging requests from blocking your pipeline indefinitely.