Adam Bandel


Transformer Training Full Pipeline

Oct 2025
Type: data-pipeline
Code: 25k lines
Files: 188
Active: Jun 2025 — Oct 2025
Stack:
PythonDuckDBPyTorchApache ParquetHugging Face TransformersMosaicML Streaming
Tags:
aidatamachine-learning

Overview

A comprehensive, production-grade data curation system designed to transform 20+ heterogeneous datasets into a training-optimized 40-billion-token corpus for pre-training a 1B-parameter language model. The pipeline implements a declarative, manifest-driven architecture where a central DuckDB database serves as an immutable ledger tracking every document through 22 distinct processing stages—from raw ingestion to tokenized MDS shards ready for distributed training.

The project represents a complete, hands-on learning system for modern LLM engineering, demonstrating techniques from data curation through alignment. The target model persona is a “Clinical Assistant”—direct, precise, and truthful with calibrated uncertainty, using a special [REFUSE] token to admit ignorance rather than hallucinate.

Screenshots

Pipeline Funnel Dashboard

Manifest Database Query

MDS Shard Inspector

Problem

Training a high-quality language model requires more than raw data volume—it demands a carefully curated corpus with consistent quality, appropriate diversity, and proper deduplication. Public datasets arrive in heterogeneous formats with varying quality levels, duplicate content, PII, benchmark contamination, and other issues that can degrade model performance. Traditional ad-hoc filtering scripts lack reproducibility, auditability, and the ability to iterate on curation decisions without re-processing entire datasets.

The challenge was building a system that could:

Approach

The solution is a declarative, manifest-driven pipeline where all curation logic is defined in YAML configuration files and executed as a directed acyclic graph of processing stages.

Stack

Challenges

Outcomes

The pipeline successfully processes 20+ public datasets through all 22 stages, producing:

Key learnings: the immutable ledger pattern dramatically simplifies debugging and enables “what-if” analysis on filtering decisions. Separating metadata (DuckDB) from content (Parquet) provides the right tradeoff between query flexibility and storage efficiency.

Implementation Notes

The pipeline follows a strict ordering from cheapest to most expensive operations:

s00 Validation       → s01 Parquet Conversion → s02 HTML Cleaning
s03 Text Normalize   → s04 Heuristic Clean    → s04a PII Redaction
s05 Build Manifest   → s06 Assign Splits      → s07 Exact Dedup
s08 Language Filter  → s09 Quality Signals    → s09a Code Detection
s10 Metadata Filter  → s11 Decontamination    → s13 Heuristic Scoring
s14 Coarse Filter    → s15 Near Dedup (LSH)   → s16 PPL Tagging
s17 Toxicity Tag     → s18 Final Quality      → s19 Axis Tagging
s20 Control Tokens   → s21 Finalization (MDS)

Document validity is tracked via is_valid and filter_reason columns. Each stage only marks documents as filtered—nothing is deleted—enabling full audit trails:

-- Find documents filtered by specific stage
SELECT source_dataset, filter_reason, COUNT(*)
FROM documents
WHERE NOT is_valid
GROUP BY source_dataset, filter_reason;

The refusal token strategy uses [REFUSE] and [RESPOND] prefixes to teach calibrated uncertainty:

<|im_start|>assistant
[RESPOND] The capital of France is Paris.<|im_end|>

<|im_start|>assistant
[REFUSE] I cannot provide real-time stock prices.<|im_end|>

These tokens are embedded in the tokenizer vocabulary from day one, enabling consistent training from SFT through DPO alignment where refusal is explicitly preferred over hallucination.


Related Posts

No posts yet.