Generative Gateway
Jan 2026Overview
Generative Gateway is a provider-agnostic HTTP gateway that sits between client applications and multiple LLM providers (currently OpenRouter). It provides a unified API for chat completions, embeddings, and multi-modal interactions while adding enterprise-grade features like per-request cost tracking, session persistence, real-time streaming, and comprehensive observability.
The gateway abstracts away the complexity of different provider APIs, allowing applications to switch models or providers without code changes. It tracks every request with token counts, latency, and USD cost estimates, enabling teams to monitor and control their AI spending.
Screenshots



Problem
Modern AI applications often need to:
- Use multiple LLM providers or switch between them based on cost, performance, or availability
- Track usage and costs at a granular level for billing and budgeting
- Maintain conversation history across sessions
- Handle streaming responses, tool calling, and multi-modal inputs (images, PDFs, audio)
- Support “thinking” models with extended reasoning capabilities
Building these features from scratch for each application is time-consuming and error-prone. Existing solutions either lock you into a single provider or lack the observability needed for production workloads.
Approach
The gateway implements a layered architecture that cleanly separates concerns while maintaining flexibility.
Stack
- FastAPI - Async Python framework chosen for its native async support, automatic OpenAPI documentation, and Pydantic integration for request validation
- Next.js 16 + React 19 - Modern frontend stack with React Query for data fetching and server components for optimal performance
- SQLite - Embedded database for zero-config deployment; schema supports easy migration to PostgreSQL for scaling
- OpenTelemetry - Distributed tracing and metrics collection, exportable to any observability backend
- httpx - Async HTTP client for provider requests with streaming support
- Server-Sent Events - Real-time streaming of token deltas, reasoning output, and generated images
Challenges
-
Unified Thinking Interface - Different providers implement extended reasoning differently (DeepSeek uses
:thinkingsuffix, Claude usesbudget_tokens, Gemini useseffortlevels). Solved by mapping a singlethinkingconfiguration to provider-specific implementations at the adapter layer. -
Multi-Modal Message Handling - Supporting text, images (base64/URL), PDFs, and audio in a single message format required careful schema design. Implemented a
ContentPartunion type that validates content based on message type and model capabilities. -
Session Persistence with Images - Storing generated images in conversation history for context continuity. Images are persisted as base64 in the session store and properly decoded when loading history for subsequent requests.
-
Real-Time Cost Tracking - Computing accurate USD costs requires knowing per-model token pricing, which changes frequently. Implemented a pricing refresh system that pulls current rates from OpenRouter and caches them locally, with manual refresh endpoint for immediate updates.
Outcomes
The gateway successfully provides:
- Provider Abstraction - Single API works across 100+ models from different providers
- Cost Visibility - Every request logged with input/output tokens and USD cost
- Session Management - Automatic conversation history with lazy loading
- Resilience - Retry with exponential backoff, hedged requests for latency-sensitive calls
- Observability - Prometheus metrics, OpenTelemetry traces, structured JSON logging
Key learning: The adapter pattern works well for provider abstraction, but the real complexity lives in normalizing response formats. Provider APIs return reasoning, tool calls, and multi-modal content in wildly different structures.
Implementation Notes
Streaming Architecture
The gateway uses Server-Sent Events with typed event payloads:
# Event types for structured streaming
class SSEEvent:
META = "meta" # Session ID, model info
TOKEN = "token" # Text delta
REASONING = "reasoning" # Thinking output
TOOL_CALLS = "tool_calls" # Function calls
IMAGES = "images" # Generated image data
DONE = "done" # Final usage stats
Model Resolution
Fuzzy matching with rapidfuzz allows users to reference models by partial names:
# User can request "claude-3.5" instead of full ID
model = resolve_model("claude-3.5")
# Returns: "anthropic/claude-3.5-sonnet"
Cost Calculation
Every request calculates cost using cached pricing:
cost_usd = (
(input_tokens / 1_000_000) * pricing.input_per_million +
(output_tokens / 1_000_000) * pricing.output_per_million +
(web_searches * 0.01) # If web search enabled
)
Rate Limiting
Token bucket algorithm with per-project-per-minute buckets:
@dataclass
class TokenBucket:
tokens: float
last_refill: float
capacity: int
refill_rate: float # tokens per second
Related Posts
No posts yet.