Adam Bandel


Generative Gateway

Jan 2026
Type: api
Code: 7k lines
Files: 49
Active: Nov 2025 — Jan 2026
Stack:
PythonFastAPITypeScriptNext.jsSQLiteOpenTelemetry
Tags:
aideveloper-toolsapi

Overview

Generative Gateway is a provider-agnostic HTTP gateway that sits between client applications and multiple LLM providers (currently OpenRouter). It provides a unified API for chat completions, embeddings, and multi-modal interactions while adding enterprise-grade features like per-request cost tracking, session persistence, real-time streaming, and comprehensive observability.

The gateway abstracts away the complexity of different provider APIs, allowing applications to switch models or providers without code changes. It tracks every request with token counts, latency, and USD cost estimates, enabling teams to monitor and control their AI spending.

Screenshots

Chat Interface

Model Catalog

Usage Dashboard

Problem

Modern AI applications often need to:

Building these features from scratch for each application is time-consuming and error-prone. Existing solutions either lock you into a single provider or lack the observability needed for production workloads.

Approach

The gateway implements a layered architecture that cleanly separates concerns while maintaining flexibility.

Stack

Challenges

Outcomes

The gateway successfully provides:

Key learning: The adapter pattern works well for provider abstraction, but the real complexity lives in normalizing response formats. Provider APIs return reasoning, tool calls, and multi-modal content in wildly different structures.

Implementation Notes

Streaming Architecture

The gateway uses Server-Sent Events with typed event payloads:

# Event types for structured streaming
class SSEEvent:
    META = "meta"        # Session ID, model info
    TOKEN = "token"      # Text delta
    REASONING = "reasoning"  # Thinking output
    TOOL_CALLS = "tool_calls"  # Function calls
    IMAGES = "images"    # Generated image data
    DONE = "done"        # Final usage stats

Model Resolution

Fuzzy matching with rapidfuzz allows users to reference models by partial names:

# User can request "claude-3.5" instead of full ID
model = resolve_model("claude-3.5")
# Returns: "anthropic/claude-3.5-sonnet"

Cost Calculation

Every request calculates cost using cached pricing:

cost_usd = (
    (input_tokens / 1_000_000) * pricing.input_per_million +
    (output_tokens / 1_000_000) * pricing.output_per_million +
    (web_searches * 0.01)  # If web search enabled
)

Rate Limiting

Token bucket algorithm with per-project-per-minute buckets:

@dataclass
class TokenBucket:
    tokens: float
    last_refill: float
    capacity: int
    refill_rate: float  # tokens per second

Related Posts

No posts yet.