Adam Bandel


Music Analyzer

Jan 2026
Type: desktop
Code: 9k lines
Files: 27
Active: Jan 2026 — Jan 2026
Stack:
PythonPyTorchFastAPIWebSocketTransformerslibrosaDemucsChart.js
Tags:
aiaudiomusicdeveloper-tools

Overview

Music Analyzer is a sophisticated real-time audio analysis system that captures system audio and extracts comprehensive musical features using a combination of AI models and digital signal processing. It generates “Music Intelligence Profiles” optimized for AI music generation systems like Suno v5.

The system runs as a FastAPI web server with WebSocket streaming, providing live visualization of audio analysis at 10Hz refresh rate. It combines multiple specialized analyzers—from neural network classifiers to traditional DSP algorithms—and uses modal fusion to cross-validate signals between models for improved accuracy.

Screenshots

Real-time Analysis Dashboard

AI Prompt Export Modal

Temporal Analysis Charts

Problem

Creating effective prompts for AI music generation requires detailed understanding of a song’s musical characteristics—genre, instrumentation, tempo, key, production style, and emotional arc. Manually describing these attributes is tedious and often inaccurate.

Existing audio analysis tools either focus on single aspects (just beat detection, just genre classification) or require offline processing. There was no unified system that could analyze music in real-time across multiple dimensions and output prompts specifically optimized for AI music generators.

Approach

Built a multi-model analysis pipeline that combines the strengths of different approaches: fixed-taxonomy classification (AST), zero-shot classification (CLAP), neural pitch detection (CREPE), source separation (Demucs), and traditional DSP algorithms (librosa).

Stack

Challenges

Outcomes

The system successfully generates detailed Suno v5 prompts that capture a song’s evolving characteristics—not just static tags, but per-section descriptions of dynamics, texture, production techniques, and instrumentation. The modal fusion approach measurably reduces model hallucinations compared to single-model classification.

Key learnings:

Implementation Notes

Multi-Model Architecture

┌──────────────────────────────────────────────────────┐
│         PARALLEL ANALYSIS MODULES (GPU)              │
├──────────────────────────────────────────────────────┤
│ AST (527 AudioSet classes) → Fixed taxonomy         │
│ CLAP (306 zero-shot labels) → Flexible detection    │
│ CREPE → Monophonic pitch                            │
│ Demucs → Vocals/drums/bass/other stems              │
│ librosa → Beat, chord, key, spectral features       │
│ emotion2vec → Vocal emotion (display only)          │
└──────────────────────────────────────────────────────┘
            ↓
┌──────────────────────────────────────────────────────┐
│              MODAL FUSION LAYER                       │
│  Cross-validate: AST + CLAP + Demucs signals         │
│  Gate unreliable detections, boost confirmed ones    │
└──────────────────────────────────────────────────────┘

CLAP Multi-Category Batching

Rather than running CLAP 6 times (once per category), text embeddings for all 306 labels are precomputed once at startup:

# Precompute embeddings for all labels
self.label_embeddings = {}
for category, labels in self.categories.items():
    with torch.no_grad():
        text_inputs = self.processor(text=labels, return_tensors="pt", padding=True)
        self.label_embeddings[category] = self.model.get_text_features(**text_inputs)

# In hot loop: only run audio encoder, then dot product with cached text embeddings
audio_features = self.model.get_audio_features(**audio_inputs)
for category, text_emb in self.label_embeddings.items():
    similarity = (audio_features @ text_emb.T).softmax(dim=-1)

Session Frame Recording

Analysis runs at 10Hz, but session frames are recorded at 1Hz for manageable data:

@dataclass
class SessionFrame:
    timestamp: float
    spectral: SpectralFeatures
    structure: MusicStructure
    production: ProductionMetrics
    stem_levels: dict[str, float]
    clap_results: dict[str, list[tuple[str, float]]]
    ast_predictions: list[tuple[str, float]]

Aggregation happens on session stop, with section detection using weighted multi-feature change scoring:


Related Posts

No posts yet.