webproxy/docs/ARCHITECTURE.md
Jeremy Meyer 17bba2d040 feat: initial monorepo setup with Next.js landing page
- pnpm workspaces monorepo with apps/ and packages/
- Next.js 16 landing page (apps/web) with dark theme, feature overview
- Package stubs: @webproxy/core, @webproxy/indexer, @webproxy/shared
- Proxy server placeholder (apps/proxy)
- Project spec, architecture docs, and deployment guide
- Gitea remote configured at 185.191.239.154:3000

Co-Authored-By: UnicornDev <noreply@unicorndev.wtf>
2026-02-26 18:24:28 -08:00

3.7 KiB

WebProxy Architecture

Overview

WebProxy is a self-hosted web indexing proxy that crawls, caches, and serves internet content to devices on your local network.

Based on research into cutting-edge open source tools, the following stack is recommended for a TypeScript/Node.js monorepo:

Tier 1: Core (Native TypeScript)

Component Tool npm Package Role
Crawler Crawlee crawlee, @crawlee/playwright Crawl/index topics of interest
Forward Proxy http-proxy-3 http-proxy-3 Transparent caching proxy for network devices
MITM Proxy mockttp mockttp HTTPS interception, request/response capture
WARC Storage warcio.js warcio Write/read WARC archives of cached content
Search MeiliSearch meilisearch (client) Full-text search of indexed content
Dashboard Next.js next Web dashboard for management

Tier 2: Utility Libraries

Component Package Role
Content extraction @mozilla/readability Extract article content from HTML
HTML to Markdown turndown Convert HTML to Markdown for AI/LLM use
DOM parsing cheerio Server-side HTML parsing
Caching layer keyv or custom LRU HTTP response caching with TTL

Tier 3: Optional Sidecars (Docker)

Component Tool Role
Replay engine pywb Serve archived WARC content Wayback-style
Deep archiving ArchiveBox Comprehensive page archiving

Why These Choices

Crawlee over Scrapy/Nutch/Crawl4AI

  • Native TypeScript - single language across the monorepo
  • Supports Puppeteer, Playwright, Cheerio, and raw HTTP crawlers
  • Built-in request queue, proxy rotation, autoscaling
  • Actively maintained by Apify (monthly releases)

http-proxy-3 + mockttp over Squid/mitmproxy

  • Pure Node.js - no external binary management
  • http-proxy-3 is a modern rewrite fixing socket leaks, partial HTTP/2
  • mockttp provides full MITM capabilities natively in TypeScript

MeiliSearch over OpenSearch/YaCy/Solr

  • Rust binary - lightweight, runs on Raspberry Pi (~1GB RAM)
  • Official Node.js SDK with TypeScript types
  • Auto language detection, typo tolerance
  • Single binary deployment

warcio.js over warcprox

  • Native TypeScript (v2.0+) by the Webrecorder team
  • Streaming WARC read/write for both browser and Node.js
  • No Python dependency

Data Flow

Network Device                     Internet
     |                                |
     | HTTP(S) request                |
     v                                |
  [proxy]                             |
     |                                |
     |--> Cache hit? --> [storage] --> Serve from WARC
     |                                |
     |--> Cache miss? --------------> Fetch from Internet
     |         |                      |
     |         +--> [storage] ------> Write to WARC
     |         +--> [indexer] ------> Index in MeiliSearch
     |                                |
     v                                |
  Response to Device                  |
                                      |
  [crawler] (scheduled) ------------> Crawl topics
     |                                |
     +--> [storage] ----------------> Write to WARC
     +--> [indexer] ----------------> Index in MeiliSearch

Package Structure

packages/
  core/          # http-proxy-3 + mockttp based forward proxy
  indexer/       # Crawlee-based topic crawler + MeiliSearch indexing
  shared/        # Shared types, utilities, config schemas

apps/
  web/           # Next.js landing page & admin dashboard
  proxy/         # Main proxy server entry point