- pnpm workspaces monorepo with apps/ and packages/ - Next.js 16 landing page (apps/web) with dark theme, feature overview - Package stubs: @webproxy/core, @webproxy/indexer, @webproxy/shared - Proxy server placeholder (apps/proxy) - Project spec, architecture docs, and deployment guide - Gitea remote configured at 185.191.239.154:3000 Co-Authored-By: UnicornDev <noreply@unicorndev.wtf>
3.7 KiB
3.7 KiB
WebProxy Architecture
Overview
WebProxy is a self-hosted web indexing proxy that crawls, caches, and serves internet content to devices on your local network.
Recommended Tech Stack
Based on research into cutting-edge open source tools, the following stack is recommended for a TypeScript/Node.js monorepo:
Tier 1: Core (Native TypeScript)
| Component | Tool | npm Package | Role |
|---|---|---|---|
| Crawler | Crawlee | crawlee, @crawlee/playwright |
Crawl/index topics of interest |
| Forward Proxy | http-proxy-3 | http-proxy-3 |
Transparent caching proxy for network devices |
| MITM Proxy | mockttp | mockttp |
HTTPS interception, request/response capture |
| WARC Storage | warcio.js | warcio |
Write/read WARC archives of cached content |
| Search | MeiliSearch | meilisearch (client) |
Full-text search of indexed content |
| Dashboard | Next.js | next |
Web dashboard for management |
Tier 2: Utility Libraries
| Component | Package | Role |
|---|---|---|
| Content extraction | @mozilla/readability |
Extract article content from HTML |
| HTML to Markdown | turndown |
Convert HTML to Markdown for AI/LLM use |
| DOM parsing | cheerio |
Server-side HTML parsing |
| Caching layer | keyv or custom LRU |
HTTP response caching with TTL |
Tier 3: Optional Sidecars (Docker)
| Component | Tool | Role |
|---|---|---|
| Replay engine | pywb | Serve archived WARC content Wayback-style |
| Deep archiving | ArchiveBox | Comprehensive page archiving |
Why These Choices
Crawlee over Scrapy/Nutch/Crawl4AI
- Native TypeScript - single language across the monorepo
- Supports Puppeteer, Playwright, Cheerio, and raw HTTP crawlers
- Built-in request queue, proxy rotation, autoscaling
- Actively maintained by Apify (monthly releases)
http-proxy-3 + mockttp over Squid/mitmproxy
- Pure Node.js - no external binary management
http-proxy-3is a modern rewrite fixing socket leaks, partial HTTP/2mockttpprovides full MITM capabilities natively in TypeScript
MeiliSearch over OpenSearch/YaCy/Solr
- Rust binary - lightweight, runs on Raspberry Pi (~1GB RAM)
- Official Node.js SDK with TypeScript types
- Auto language detection, typo tolerance
- Single binary deployment
warcio.js over warcprox
- Native TypeScript (v2.0+) by the Webrecorder team
- Streaming WARC read/write for both browser and Node.js
- No Python dependency
Data Flow
Network Device Internet
| |
| HTTP(S) request |
v |
[proxy] |
| |
|--> Cache hit? --> [storage] --> Serve from WARC
| |
|--> Cache miss? --------------> Fetch from Internet
| | |
| +--> [storage] ------> Write to WARC
| +--> [indexer] ------> Index in MeiliSearch
| |
v |
Response to Device |
|
[crawler] (scheduled) ------------> Crawl topics
| |
+--> [storage] ----------------> Write to WARC
+--> [indexer] ----------------> Index in MeiliSearch
Package Structure
packages/
core/ # http-proxy-3 + mockttp based forward proxy
indexer/ # Crawlee-based topic crawler + MeiliSearch indexing
shared/ # Shared types, utilities, config schemas
apps/
web/ # Next.js landing page & admin dashboard
proxy/ # Main proxy server entry point