Adam Bandel


LLM Benchmark Aggregator

Dec 2025
Type: data-pipeline
Code: 23k lines
Files: 128
Active: Dec 2025 — Dec 2025
Stack:
PythonFastAPISQLAlchemyReactTypeScriptSQLite
Tags:
aidatadeveloper-tools

Overview

LLM Benchmark Aggregator is a full-stack service that collects, normalizes, and visualizes AI model performance data from over 50 disparate benchmark sources. It solves the fragmentation problem in the AI evaluation space—where each leaderboard uses different model naming conventions, scoring formats, and update frequencies—by creating a unified view with canonical model identities.

The system combines web scraping, API integrations, HuggingFace datasets, and CSV imports through 18 specialized adapters, all feeding into a normalized SQLite database with historical tracking capabilities.

Screenshots

Dashboard Overview

Benchmark Leaderboard

Model Hierarchy

Problem

AI benchmark results are scattered across dozens of platforms—LiveBench, LMArena, HuggingFace Open LLM Leaderboard, Chatbot Arena, ARC-AGI, and many more. Each source:

Researchers and developers wanting a holistic view of model performance must manually visit multiple sites and mentally reconcile different naming schemes.

Approach

The aggregator treats benchmark collection as an ETL pipeline with an intelligent identity layer.

Stack

Challenges

Outcomes

The system successfully aggregates benchmarks from 50+ sources into a queryable, comparable format. Key capabilities:

The adapter pattern proved highly extensible—adding a new benchmark source requires only implementing a single parse() method.

Implementation Notes

The adapter registry enables dynamic source handling:

ADAPTER_REGISTRY = {
    "livebench": LiveBenchAdapter,
    "lmarena": LMArenaAdapter,
    "artificial_analysis": ArtificialAnalysisAdapter,
    "epoch_csv": EpochCSVAdapter,
    # ... 14 more adapters
}

Each adapter inherits from BaseAdapter and implements source-specific parsing:

class LiveBenchAdapter(BaseAdapter):
    async def parse(self, content: str) -> list[BenchmarkResult]:
        # Extract markdown tables, normalize scores, map model names
        ...

The canonical model hierarchy is stored across related tables:

eras (technological generations)
  └── model_families (grouped by provider + release)
        └── canonical_models (authoritative identities)
              └── model_variants (source-specific aliases)

The 7-step LLM pipeline handles:

  1. Provider detection
  2. Family grouping
  3. Era assignment
  4. Modality classification
  5. Release date discovery
  6. Duplicate detection
  7. Confidence scoring

Related Posts

No posts yet.