News Aggregator

Jan 2026

Type: web-app

Code: 20k lines

Files: 90

Active: Nov 2024 — Jan 2026

Stack:

Tags:

Overview

News Aggregator is a self-hosted platform designed to combat information overload by intelligently collecting, filtering, and enriching content from multiple sources. It aggregates posts from Reddit, Twitter/X, RSS feeds, and custom websites into a unified interface, then applies LLM-powered analysis to de-sensationalize headlines, extract key concepts, and score content relevance.

The system follows a microservices architecture with specialized scrapers, a central data pipeline, and an LLM enrichment layer. Users can define preference rules to automatically boost or penalize content based on keywords, sentiment, categories, and engagement metrics.

Screenshots

Main feed view

Preference rules configuration

Article detail with AI analysis

Problem

Modern news consumption involves checking multiple platforms (Reddit, Twitter, RSS readers) while being bombarded with sensationalized headlines, duplicate stories, and irrelevant content. There’s no unified way to:

Aggregate content across different source types
Filter out low-quality or repetitive posts
Get objective summaries without clickbait
Personalize feeds based on actual preferences rather than engagement-maximizing algorithms

Approach

The solution uses a distributed microservices architecture where each component has a single responsibility.

Stack

Backend Framework - FastAPI for async REST APIs across all services, enabling high-throughput scraping without blocking
Database - PostgreSQL 16 with JSONB columns for flexible metadata storage (LLM outputs, engagement metrics, source-specific fields)
Task Scheduling - APScheduler for periodic scrape jobs with configurable intervals per source
Reddit Integration - asyncpraw for authenticated Reddit API access with rate limit handling
Twitter Scraping - Nitter-based scraper to bypass API restrictions
LLM Integration - OpenRouter API for model-agnostic summarization and analysis
Frontend - React 18 with Material UI, featuring masonry layouts, infinite scroll, and drag-and-drop feed ordering
Deployment - Docker Compose orchestrating 9 services with health checks and dependency ordering

Challenges

Rate limiting across multiple APIs - Implemented per-source semaphores, TTL-cached in-flight request tracking, and exponential backoff to prevent thundering herd problems while maximizing throughput
Memory leaks in long-running scrapers - SQLAlchemy identity maps accumulated objects over time; solved by explicitly calling expunge_all() before closing sessions across 21 locations
Duplicate content detection - Database constraints on URLs plus UPSERT patterns for engagement metrics reduced database size by 92% (1.5GB to 120MB)
Balancing recommendation relevance - Reddit trending posts dominated feeds; implemented source fatigue penalties, type rotation boosts, and anti-consecutive algorithms to ensure diverse results

Outcomes

The platform successfully aggregates thousands of posts daily while maintaining sub-second query times. The preference scoring system allows fine-grained control over content ranking, and the LLM integration provides genuinely useful summaries that cut through sensationalism.

Key technical wins:

Microservices architecture enables independent scaling of scrapers vs. query layer
JSONB columns eliminated schema migrations for evolving LLM output formats
Health check cascade ensures services start in correct dependency order

Implementation Notes

The recommendation engine normalizes scores across different source types since Reddit and Twitter have vastly different engagement scales:

# From shared/recommendation_utils.py
def calculate_recommended_score(item, source_type, index_in_type):
    base = normalize_score(item.trending_score, source_type)

    # Apply source fatigue - recently seen sources get penalized
    fatigue_boost = math.log(max(1, index_diff)) * 3600

    # Anti-consecutive penalty prevents 5 Reddit posts in a row
    if consecutive_count > 2:
        base *= 0.7 ** (consecutive_count - 2)

    return base + recency_boost + preference_adjustment

Preference rules support complex boolean conditions stored as JSONB:

# Example rule: Boost AI content from Twitter
{
    "conditions": {
        "AND": [
            {"field": "category", "op": "contains", "value": "AI"},
            {"field": "source_type", "op": "==", "value": "Twitter"}
        ]
    },
    "adjustments": {"score": 5}
}

The data flow follows a clear pipeline: Scheduler triggers Scraper -> Scraper POSTs to Ingestor -> Ingestor stores and queues LLM jobs -> LLM Manager enriches top items -> Fetcher serves filtered results to Frontend.

No posts yet.

Adam Bandel