List Import Studio

Oct 2025

Type: desktop

Code: 48k lines

Files: 269

Active: Oct 2025 — Oct 2025

Stack:

Tags:

Overview

List Import Studio is a desktop application for importing, transforming, and enriching tabular data through visual pipelines. Users load CSV/XLSX files, map columns to reference schemas, build transformation graphs with filtering, branching, and derivation nodes, then execute workflows that handle deduplication, fuzzy record matching, and human-in-the-loop review gates.

The application combines a React/TypeScript frontend running in Tauri with a Python sidecar that handles heavy data processing via Polars. Communication happens through JSON-RPC over stdio, avoiding HTTP overhead while enabling rich bidirectional command dispatch.

Screenshots

Plan Canvas

Dataset Mapping

Transform Editor

Problem

Data imports from external sources are messy. Phone numbers come in inconsistent formats, emails need validation, company names require standardization, and duplicate records need detection. Traditional ETL tools are either too complex for one-off imports or too limited for sophisticated matching logic.

The goal was to build a tool that makes it easy to:

Visually design transformation pipelines without writing code
Apply fuzzy matching with configurable blocking rules and scoring ensembles
Pause execution at review gates for human verification of edge cases
Track data lineage to explain how any output row was derived

Approach

Stack

Tauri 2 - Desktop shell providing native file dialogs, process spawning, and cross-platform distribution without Electron’s overhead
React 19 + TypeScript - Frontend with feature-based architecture; hooks manage workflow state while Zod validates RPC contracts
Python Sidecar - Polars-based data engine spawned as a subprocess; handles all heavy computation
JSON-RPC over stdio - Custom protocol avoiding HTTP; parent process writes JSON requests to stdin, child responds via stdout
SQLite + Parquet - Workspace metadata in SQLite; cached pipeline stages in columnar Parquet for efficient resumption

Challenges

Row identity across transforms - After filtering, standardizing, and deduplicating, tracking which output rows correspond to which inputs requires persistent identifiers. Solution: each row gets a row_ref string that propagates through the pipeline, enabling the explain/review features to fetch original values.
Multi-dataset lineage - Match and Relate nodes combine two datasets, complicating schema propagation. Solution: routes track source_dataset_ids, and the frontend builds field selectors that label origins as “From Dataset: X” or “From Node: Transform”.
Pause/resume execution - Full runs on large datasets can take significant time; users need to pause at review nodes and resume later. Solution: cache checkpoints after each node with stable keys derived from input data + config hashes. Resume rehydrates from the last checkpoint without re-running earlier stages.
Quick-fix suggestions - When validation detects issues (e.g., invalid phone formats), the UI should suggest fixes. Solution: validator registry maps issue codes to transform suggestions; clicking “Apply fix” inserts the appropriate Transform node with pre-configured normalization.

Outcomes

The visual pipeline approach proves effective for complex data workflows. Key wins:

Iterative refinement - Preview at any node shows exactly what data looks like mid-pipeline; adjust transform config and immediately see results
Caching efficiency - Re-running after config changes only recomputes affected nodes; unchanged upstream stages load from Parquet cache
Review workflow - Match scoring produces histograms; users set thresholds and manually review borderline cases before export
Lineage transparency - Clicking any output row shows its derivation path through the pipeline with intermediate values

Implementation Notes

JSON-RPC Protocol

Frontend and sidecar communicate via JSON-RPC 2.0 over stdio:

// Frontend: src/services/sidecar/client.ts
export async function callRpc<T>(method: string, params: unknown): Promise<T> {
  const response = await Command.create('sidecar', ['--rpc'])
    .execute(JSON.stringify({ jsonrpc: '2.0', method, params, id: generateId() }));
  return JSON.parse(response.stdout).result;
}

# Sidecar: sidecar/rpc/dispatcher.py
def dispatch(request: dict) -> dict:
    handler = REGISTRY.get(request['method'])
    result = handler(**request.get('params', {}))
    return {'jsonrpc': '2.0', 'result': result, 'id': request['id']}

Plan Execution Engine

The plan engine traverses the node graph, caching results at each stage:

# sidecar/plan_engine.py
def execute_node(node: PlanNode, context: PlanContext) -> DataFrame:
    cache_key = compute_stable_key(node, context.upstream_data)

    if cached := context.cache.get(cache_key):
        return cached

    match node.type:
        case 'transform':
            result = apply_standardizers(context.upstream_data, node.config.operations)
        case 'filter':
            result = apply_filter_conditions(context.upstream_data, node.config.conditions)
        case 'branch':
            # Returns dict of outcome -> DataFrame
            result = route_by_conditions(context.upstream_data, node.config.branches)
        case 'match':
            result = score_candidates(context.upstream_data, node.config.ensemble)

    context.cache.set(cache_key, result)
    return result

Blocking Rules for Matching

Fuzzy matching at scale requires blocking to avoid O(n*m) comparisons:

# sidecar/match/blocking.py
def build_blocking_index(records: DataFrame, rules: list[BlockingRule]) -> dict:
    """Group records by blocking keys for efficient candidate generation."""
    index = defaultdict(list)
    for row in records.iter_rows(named=True):
        for rule in rules:
            key = normalize_blocking_key(row, rule)
            index[key].append(row['_row_ref'])
    return index

State Persistence

Workspace state auto-saves to SQLite on navigation, enabling seamless resume:

// Frontend: src/features/workspace/hooks/useProjectPersistence.ts
const saveState = useCallback(async () => {
  const snapshot = {
    plan: planState,
    datasets: datasetState,
    mapping: mappingState,
    settings: settingsState,
  };
  await rpc.project.saveState(projectId, snapshot);
}, [projectId, planState, datasetState, mappingState, settingsState]);

Oct 15, 2025 Devlog - List Import Studio

Adam Bandel