Inbox Processor

Jan 2026

Type: web-app

Code: 1k lines

Files: 23

Active: Nov 2025 — Jan 2026

Stack:

Tags:

Overview

Inbox Processor is a unified “second brain” entry point that aggregates content from multiple sources—Gmail and manual notes—into a single processing pipeline. It standardizes inputs, enforces context assignment, detects duplicates using both hash-based and semantic similarity matching, and intelligently routes items to their final destinations in external systems.

The system addresses a common knowledge management pain point: information arrives from multiple channels but lacks a unified system for organization, deduplication, and routing. Rather than manually sorting emails and notes into different apps, Inbox Processor provides a triage interface that ensures nothing falls through the cracks.

Screenshots

Inbox View

Duplicate Detection

Item Detail Modal

Problem

Information overload is real. Emails containing bookmarks, articles, media recommendations, and notes arrive constantly but end up scattered across inboxes without proper categorization. Manual triage is tedious, duplicates slip through, and items meant for specific systems (like a media feed or note-taking app) require manual copy-paste workflows.

The goal was to create a single intake point that:

Pulls content from email automatically
Allows quick manual note entry
Requires context assignment upfront (no more “I’ll organize this later”)
Catches duplicates before they clutter destination systems
Routes items to the right place with one click

Approach

The system uses a hybrid architecture combining traditional CRUD operations with vector embeddings for intelligent deduplication.

Stack

FastAPI (Python) - Async-first API framework with automatic OpenAPI documentation. Handles Gmail sync, CRUD operations, and external service integration.
React + TypeScript - Modern frontend with hooks-based state management. Vite for fast development builds, Tailwind CSS for styling.
SQLite + SQLAlchemy - Lightweight relational storage for inbox items with full ORM support. No separate database server needed.
ChromaDB - Vector database for semantic similarity search. Uses all-MiniLM-L6-v2 embeddings to detect near-duplicate content even when wording differs.
Docker Compose - Containerized deployment with multi-stage builds for both frontend (Nginx) and backend (Uvicorn).

Challenges

Semantic vs. Exact Deduplication - Hash-based deduplication catches identical content but misses paraphrased duplicates. Solved by combining SHA256 hashes with vector similarity (cosine distance < 0.5). Added secondary attachment checks to prevent false positives when file attachments differ.
Bulk Operations with External Services - Archiving items needs to sync state back to Gmail. Implemented async batch operations with semaphore-based throttling to avoid overwhelming the Gmail API while keeping the UI responsive.
Context Enforcement - Users tend to defer organization. Made context selection mandatory on both manual entry and import, with clear visual indicators. Items can’t be processed without a context.

Outcomes

The system successfully consolidates the email-to-second-brain workflow into a single interface. Key wins:

Faster triage - Bulk selection and one-click routing eliminates repetitive copy-paste
Fewer duplicates - Semantic matching catches items that hash-based dedup misses
Forced organization - Mandatory context prevents the “unsorted” pile from growing
Audit trail - All items tracked through their lifecycle (inbox → processed → archived)

Vector embeddings proved surprisingly effective for catching “soft duplicates” like forwarded emails or slightly reworded bookmarks. The hybrid approach (hash + vector) provides both speed and accuracy.

Implementation Notes

Deduplication Pipeline

New items pass through a two-stage deduplication check:

# Stage 1: Exact hash match
content_hash = hashlib.sha256(
    (content + "".join(sorted(attachment_names))).encode()
).hexdigest()

existing = db.query(InboxItem).filter(
    InboxItem.content_hash == content_hash
).first()

# Stage 2: Semantic similarity via ChromaDB
if not existing:
    results = collection.query(
        query_texts=[content],
        n_results=5,
        where={"context": context}
    )
    for distance in results["distances"][0]:
        if distance < 0.5:  # Similarity threshold
            # Flag as potential duplicate for manual review

Data Model

Items flow through defined states with clear transitions:

inbox → processed → archived
          ↓
       duplicate (linked to original)
          ↓
       pmf_failed (routing error captured)

The InboxItem model tracks source (gmail or manual), source_id for Gmail idempotency, mandatory context assignment, and attachments as JSON for file metadata.

Frontend Filtering

The sidebar provides multi-dimensional filtering:

Context: JAR, CMR, Personal, Uncategorized
Type: Web Links, Text Emails, Attachments, YouTube, IMDb
Status: Processed, Failed, Archived

Filters combine with useMemo for client-side performance. Dark mode persists to localStorage with system preference detection on first load.

No posts yet.

Adam Bandel