Music Analyzer

Jan 2026

Type: desktop

Code: 9k lines

Files: 27

Active: Jan 2026 — Jan 2026

Stack:

Tags:

Overview

Music Analyzer is a sophisticated real-time audio analysis system that captures system audio and extracts comprehensive musical features using a combination of AI models and digital signal processing. It generates “Music Intelligence Profiles” optimized for AI music generation systems like Suno v5.

The system runs as a FastAPI web server with WebSocket streaming, providing live visualization of audio analysis at 10Hz refresh rate. It combines multiple specialized analyzers—from neural network classifiers to traditional DSP algorithms—and uses modal fusion to cross-validate signals between models for improved accuracy.

Screenshots

Real-time Analysis Dashboard

AI Prompt Export Modal

Temporal Analysis Charts

Problem

Creating effective prompts for AI music generation requires detailed understanding of a song’s musical characteristics—genre, instrumentation, tempo, key, production style, and emotional arc. Manually describing these attributes is tedious and often inaccurate.

Existing audio analysis tools either focus on single aspects (just beat detection, just genre classification) or require offline processing. There was no unified system that could analyze music in real-time across multiple dimensions and output prompts specifically optimized for AI music generators.

Approach

Built a multi-model analysis pipeline that combines the strengths of different approaches: fixed-taxonomy classification (AST), zero-shot classification (CLAP), neural pitch detection (CREPE), source separation (Demucs), and traditional DSP algorithms (librosa).

Stack

Audio Capture - PyAudioWPatch with WASAPI loopback for zero-config system audio capture on Windows
Neural Models - PyTorch with CUDA for GPU-accelerated inference (AST, CLAP, CREPE, emotion2vec, Demucs)
DSP Analysis - librosa for spectral features, beat tracking, chord detection, and key estimation
Web Server - FastAPI with WebSocket streaming for real-time browser visualization
Frontend - Vanilla JS with Chart.js for live metrics and temporal charts
LLM Integration - httpx async client connecting to Generative Gateway for enhanced prompt generation

Challenges

Model hallucinations - Individual models often misclassify audio. Solved with modal fusion: cross-validating signals between AST, CLAP, and Demucs. If AST detects “singing” but Demucs shows no vocal stem energy, the vocal detection is suppressed.
VRAM pressure - Running 5+ neural models simultaneously exhausts GPU memory. Implemented lazy loading where models only load when explicitly enabled, plus per-model toggles in the UI.
Real-time performance - Needed consistent 10Hz analysis with GPU models. Precomputed CLAP text embeddings (eliminating encoder passes in hot loop) and batched multi-category inference achieve 10-50x speedup.
Song structure detection - Detecting verse/chorus/bridge boundaries without beat grid information. Built multi-feature change detection tracking energy, brightness, warmth, and vocal presence with adaptive thresholding and fallback auto-splitting.

Outcomes

The system successfully generates detailed Suno v5 prompts that capture a song’s evolving characteristics—not just static tags, but per-section descriptions of dynamics, texture, production techniques, and instrumentation. The modal fusion approach measurably reduces model hallucinations compared to single-model classification.

Key learnings:

Zero-shot models (CLAP) complement fixed-taxonomy models (AST) well—each catches what the other misses
Precomputing static embeddings is essential for real-time neural network inference
Cross-model validation is more robust than confidence thresholds for filtering false positives
Frame-based session recording enables clean separation between real-time and offline processing

Implementation Notes

Multi-Model Architecture

┌──────────────────────────────────────────────────────┐
│         PARALLEL ANALYSIS MODULES (GPU)              │
├──────────────────────────────────────────────────────┤
│ AST (527 AudioSet classes) → Fixed taxonomy         │
│ CLAP (306 zero-shot labels) → Flexible detection    │
│ CREPE → Monophonic pitch                            │
│ Demucs → Vocals/drums/bass/other stems              │
│ librosa → Beat, chord, key, spectral features       │
│ emotion2vec → Vocal emotion (display only)          │
└──────────────────────────────────────────────────────┘
            ↓
┌──────────────────────────────────────────────────────┐
│              MODAL FUSION LAYER                       │
│  Cross-validate: AST + CLAP + Demucs signals         │
│  Gate unreliable detections, boost confirmed ones    │
└──────────────────────────────────────────────────────┘

CLAP Multi-Category Batching

Rather than running CLAP 6 times (once per category), text embeddings for all 306 labels are precomputed once at startup:

# Precompute embeddings for all labels
self.label_embeddings = {}
for category, labels in self.categories.items():
    with torch.no_grad():
        text_inputs = self.processor(text=labels, return_tensors="pt", padding=True)
        self.label_embeddings[category] = self.model.get_text_features(**text_inputs)

# In hot loop: only run audio encoder, then dot product with cached text embeddings
audio_features = self.model.get_audio_features(**audio_inputs)
for category, text_emb in self.label_embeddings.items():
    similarity = (audio_features @ text_emb.T).softmax(dim=-1)

Session Frame Recording

Analysis runs at 10Hz, but session frames are recorded at 1Hz for manageable data:

@dataclass
class SessionFrame:
    timestamp: float
    spectral: SpectralFeatures
    structure: MusicStructure
    production: ProductionMetrics
    stem_levels: dict[str, float]
    clap_results: dict[str, list[tuple[str, float]]]
    ast_predictions: list[tuple[str, float]]

Aggregation happens on session stop, with section detection using weighted multi-feature change scoring:

Energy: 40%
Brightness: 30%
Vocal presence: 30%

No posts yet.

Adam Bandel