webproxy/docs/ARCHITECTURE.md
Jeremy Meyer 17bba2d040 feat: initial monorepo setup with Next.js landing page
- pnpm workspaces monorepo with apps/ and packages/
- Next.js 16 landing page (apps/web) with dark theme, feature overview
- Package stubs: @webproxy/core, @webproxy/indexer, @webproxy/shared
- Proxy server placeholder (apps/proxy)
- Project spec, architecture docs, and deployment guide
- Gitea remote configured at 185.191.239.154:3000

Co-Authored-By: UnicornDev <noreply@unicorndev.wtf>
2026-02-26 18:24:28 -08:00

99 lines
3.7 KiB
Markdown

# WebProxy Architecture
## Overview
WebProxy is a self-hosted web indexing proxy that crawls, caches, and serves internet content to devices on your local network.
## Recommended Tech Stack
Based on research into cutting-edge open source tools, the following stack is recommended for a TypeScript/Node.js monorepo:
### Tier 1: Core (Native TypeScript)
| Component | Tool | npm Package | Role |
|-----------|------|-------------|------|
| Crawler | Crawlee | `crawlee`, `@crawlee/playwright` | Crawl/index topics of interest |
| Forward Proxy | http-proxy-3 | `http-proxy-3` | Transparent caching proxy for network devices |
| MITM Proxy | mockttp | `mockttp` | HTTPS interception, request/response capture |
| WARC Storage | warcio.js | `warcio` | Write/read WARC archives of cached content |
| Search | MeiliSearch | `meilisearch` (client) | Full-text search of indexed content |
| Dashboard | Next.js | `next` | Web dashboard for management |
### Tier 2: Utility Libraries
| Component | Package | Role |
|-----------|---------|------|
| Content extraction | `@mozilla/readability` | Extract article content from HTML |
| HTML to Markdown | `turndown` | Convert HTML to Markdown for AI/LLM use |
| DOM parsing | `cheerio` | Server-side HTML parsing |
| Caching layer | `keyv` or custom LRU | HTTP response caching with TTL |
### Tier 3: Optional Sidecars (Docker)
| Component | Tool | Role |
|-----------|------|------|
| Replay engine | pywb | Serve archived WARC content Wayback-style |
| Deep archiving | ArchiveBox | Comprehensive page archiving |
## Why These Choices
### Crawlee over Scrapy/Nutch/Crawl4AI
- **Native TypeScript** - single language across the monorepo
- Supports Puppeteer, Playwright, Cheerio, and raw HTTP crawlers
- Built-in request queue, proxy rotation, autoscaling
- Actively maintained by Apify (monthly releases)
### http-proxy-3 + mockttp over Squid/mitmproxy
- Pure Node.js - no external binary management
- `http-proxy-3` is a modern rewrite fixing socket leaks, partial HTTP/2
- `mockttp` provides full MITM capabilities natively in TypeScript
### MeiliSearch over OpenSearch/YaCy/Solr
- Rust binary - lightweight, runs on Raspberry Pi (~1GB RAM)
- Official Node.js SDK with TypeScript types
- Auto language detection, typo tolerance
- Single binary deployment
### warcio.js over warcprox
- Native TypeScript (v2.0+) by the Webrecorder team
- Streaming WARC read/write for both browser and Node.js
- No Python dependency
## Data Flow
```
Network Device Internet
| |
| HTTP(S) request |
v |
[proxy] |
| |
|--> Cache hit? --> [storage] --> Serve from WARC
| |
|--> Cache miss? --------------> Fetch from Internet
| | |
| +--> [storage] ------> Write to WARC
| +--> [indexer] ------> Index in MeiliSearch
| |
v |
Response to Device |
|
[crawler] (scheduled) ------------> Crawl topics
| |
+--> [storage] ----------------> Write to WARC
+--> [indexer] ----------------> Index in MeiliSearch
```
## Package Structure
```
packages/
core/ # http-proxy-3 + mockttp based forward proxy
indexer/ # Crawlee-based topic crawler + MeiliSearch indexing
shared/ # Shared types, utilities, config schemas
apps/
web/ # Next.js landing page & admin dashboard
proxy/ # Main proxy server entry point
```