webproxy/docs/ARCHITECTURE.md

# WebProxy Architecture

## Overview

WebProxy is a self-hosted web indexing proxy that crawls, caches, and serves internet content to devices on your local network.

## Recommended Tech Stack

Based on research into cutting-edge open source tools, the following stack is recommended for a TypeScript/Node.js monorepo:

### Tier 1: Core (Native TypeScript)

| Component | Tool | npm Package | Role |
|-----------|------|-------------|------|
| Crawler | Crawlee | `crawlee`, `@crawlee/playwright` | Crawl/index topics of interest |
| Forward Proxy | http-proxy-3 | `http-proxy-3` | Transparent caching proxy for network devices |
| MITM Proxy | mockttp | `mockttp` | HTTPS interception, request/response capture |
| WARC Storage | warcio.js | `warcio` | Write/read WARC archives of cached content |
| Search | MeiliSearch | `meilisearch` (client) | Full-text search of indexed content |
| Dashboard | Next.js | `next` | Web dashboard for management |

### Tier 2: Utility Libraries

| Component | Package | Role |
|-----------|---------|------|
| Content extraction | `@mozilla/readability` | Extract article content from HTML |
| HTML to Markdown | `turndown` | Convert HTML to Markdown for AI/LLM use |
| DOM parsing | `cheerio` | Server-side HTML parsing |
| Caching layer | `keyv` or custom LRU | HTTP response caching with TTL |

### Tier 3: Optional Sidecars (Docker)

| Component | Tool | Role |
|-----------|------|------|
| Replay engine | pywb | Serve archived WARC content Wayback-style |
| Deep archiving | ArchiveBox | Comprehensive page archiving |

## Why These Choices

### Crawlee over Scrapy/Nutch/Crawl4AI
- **Native TypeScript** - single language across the monorepo
- Supports Puppeteer, Playwright, Cheerio, and raw HTTP crawlers
- Built-in request queue, proxy rotation, autoscaling
- Actively maintained by Apify (monthly releases)

### http-proxy-3 + mockttp over Squid/mitmproxy
- Pure Node.js - no external binary management
- `http-proxy-3` is a modern rewrite fixing socket leaks, partial HTTP/2
- `mockttp` provides full MITM capabilities natively in TypeScript

### MeiliSearch over OpenSearch/YaCy/Solr
- Rust binary - lightweight, runs on Raspberry Pi (~1GB RAM)
- Official Node.js SDK with TypeScript types
- Auto language detection, typo tolerance
- Single binary deployment

### warcio.js over warcprox
- Native TypeScript (v2.0+) by the Webrecorder team
- Streaming WARC read/write for both browser and Node.js
- No Python dependency

## Data Flow

```
Network Device                     Internet
     |                                |
     | HTTP(S) request                |
     v                                |
  [proxy]                             |
     |                                |
     |--> Cache hit? --> [storage] --> Serve from WARC
     |                                |
     |--> Cache miss? --------------> Fetch from Internet
     |         |                      |
     |         +--> [storage] ------> Write to WARC
     |         +--> [indexer] ------> Index in MeiliSearch
     |                                |
     v                                |
  Response to Device                  |
                                      |
  [crawler] (scheduled) ------------> Crawl topics
     |                                |
     +--> [storage] ----------------> Write to WARC
     +--> [indexer] ----------------> Index in MeiliSearch
```

## Package Structure

```
packages/
  core/          # http-proxy-3 + mockttp based forward proxy
  indexer/       # Crawlee-based topic crawler + MeiliSearch indexing
  shared/        # Shared types, utilities, config schemas

apps/
  web/           # Next.js landing page & admin dashboard
  proxy/         # Main proxy server entry point
```