# WebProxy Architecture ## Overview WebProxy is a self-hosted web indexing proxy that crawls, caches, and serves internet content to devices on your local network. ## Recommended Tech Stack Based on research into cutting-edge open source tools, the following stack is recommended for a TypeScript/Node.js monorepo: ### Tier 1: Core (Native TypeScript) | Component | Tool | npm Package | Role | |-----------|------|-------------|------| | Crawler | Crawlee | `crawlee`, `@crawlee/playwright` | Crawl/index topics of interest | | Forward Proxy | http-proxy-3 | `http-proxy-3` | Transparent caching proxy for network devices | | MITM Proxy | mockttp | `mockttp` | HTTPS interception, request/response capture | | WARC Storage | warcio.js | `warcio` | Write/read WARC archives of cached content | | Search | MeiliSearch | `meilisearch` (client) | Full-text search of indexed content | | Dashboard | Next.js | `next` | Web dashboard for management | ### Tier 2: Utility Libraries | Component | Package | Role | |-----------|---------|------| | Content extraction | `@mozilla/readability` | Extract article content from HTML | | HTML to Markdown | `turndown` | Convert HTML to Markdown for AI/LLM use | | DOM parsing | `cheerio` | Server-side HTML parsing | | Caching layer | `keyv` or custom LRU | HTTP response caching with TTL | ### Tier 3: Optional Sidecars (Docker) | Component | Tool | Role | |-----------|------|------| | Replay engine | pywb | Serve archived WARC content Wayback-style | | Deep archiving | ArchiveBox | Comprehensive page archiving | ## Why These Choices ### Crawlee over Scrapy/Nutch/Crawl4AI - **Native TypeScript** - single language across the monorepo - Supports Puppeteer, Playwright, Cheerio, and raw HTTP crawlers - Built-in request queue, proxy rotation, autoscaling - Actively maintained by Apify (monthly releases) ### http-proxy-3 + mockttp over Squid/mitmproxy - Pure Node.js - no external binary management - `http-proxy-3` is a modern rewrite fixing socket leaks, partial HTTP/2 - `mockttp` provides full MITM capabilities natively in TypeScript ### MeiliSearch over OpenSearch/YaCy/Solr - Rust binary - lightweight, runs on Raspberry Pi (~1GB RAM) - Official Node.js SDK with TypeScript types - Auto language detection, typo tolerance - Single binary deployment ### warcio.js over warcprox - Native TypeScript (v2.0+) by the Webrecorder team - Streaming WARC read/write for both browser and Node.js - No Python dependency ## Data Flow ``` Network Device Internet | | | HTTP(S) request | v | [proxy] | | | |--> Cache hit? --> [storage] --> Serve from WARC | | |--> Cache miss? --------------> Fetch from Internet | | | | +--> [storage] ------> Write to WARC | +--> [indexer] ------> Index in MeiliSearch | | v | Response to Device | | [crawler] (scheduled) ------------> Crawl topics | | +--> [storage] ----------------> Write to WARC +--> [indexer] ----------------> Index in MeiliSearch ``` ## Package Structure ``` packages/ core/ # http-proxy-3 + mockttp based forward proxy indexer/ # Crawlee-based topic crawler + MeiliSearch indexing shared/ # Shared types, utilities, config schemas apps/ web/ # Next.js landing page & admin dashboard proxy/ # Main proxy server entry point ```