- pnpm workspaces monorepo with apps/ and packages/ - Next.js 16 landing page (apps/web) with dark theme, feature overview - Package stubs: @webproxy/core, @webproxy/indexer, @webproxy/shared - Proxy server placeholder (apps/proxy) - Project spec, architecture docs, and deployment guide - Gitea remote configured at 185.191.239.154:3000 Co-Authored-By: UnicornDev <noreply@unicorndev.wtf>
99 lines
3.7 KiB
Markdown
99 lines
3.7 KiB
Markdown
# WebProxy Architecture
|
|
|
|
## Overview
|
|
|
|
WebProxy is a self-hosted web indexing proxy that crawls, caches, and serves internet content to devices on your local network.
|
|
|
|
## Recommended Tech Stack
|
|
|
|
Based on research into cutting-edge open source tools, the following stack is recommended for a TypeScript/Node.js monorepo:
|
|
|
|
### Tier 1: Core (Native TypeScript)
|
|
|
|
| Component | Tool | npm Package | Role |
|
|
|-----------|------|-------------|------|
|
|
| Crawler | Crawlee | `crawlee`, `@crawlee/playwright` | Crawl/index topics of interest |
|
|
| Forward Proxy | http-proxy-3 | `http-proxy-3` | Transparent caching proxy for network devices |
|
|
| MITM Proxy | mockttp | `mockttp` | HTTPS interception, request/response capture |
|
|
| WARC Storage | warcio.js | `warcio` | Write/read WARC archives of cached content |
|
|
| Search | MeiliSearch | `meilisearch` (client) | Full-text search of indexed content |
|
|
| Dashboard | Next.js | `next` | Web dashboard for management |
|
|
|
|
### Tier 2: Utility Libraries
|
|
|
|
| Component | Package | Role |
|
|
|-----------|---------|------|
|
|
| Content extraction | `@mozilla/readability` | Extract article content from HTML |
|
|
| HTML to Markdown | `turndown` | Convert HTML to Markdown for AI/LLM use |
|
|
| DOM parsing | `cheerio` | Server-side HTML parsing |
|
|
| Caching layer | `keyv` or custom LRU | HTTP response caching with TTL |
|
|
|
|
### Tier 3: Optional Sidecars (Docker)
|
|
|
|
| Component | Tool | Role |
|
|
|-----------|------|------|
|
|
| Replay engine | pywb | Serve archived WARC content Wayback-style |
|
|
| Deep archiving | ArchiveBox | Comprehensive page archiving |
|
|
|
|
## Why These Choices
|
|
|
|
### Crawlee over Scrapy/Nutch/Crawl4AI
|
|
- **Native TypeScript** - single language across the monorepo
|
|
- Supports Puppeteer, Playwright, Cheerio, and raw HTTP crawlers
|
|
- Built-in request queue, proxy rotation, autoscaling
|
|
- Actively maintained by Apify (monthly releases)
|
|
|
|
### http-proxy-3 + mockttp over Squid/mitmproxy
|
|
- Pure Node.js - no external binary management
|
|
- `http-proxy-3` is a modern rewrite fixing socket leaks, partial HTTP/2
|
|
- `mockttp` provides full MITM capabilities natively in TypeScript
|
|
|
|
### MeiliSearch over OpenSearch/YaCy/Solr
|
|
- Rust binary - lightweight, runs on Raspberry Pi (~1GB RAM)
|
|
- Official Node.js SDK with TypeScript types
|
|
- Auto language detection, typo tolerance
|
|
- Single binary deployment
|
|
|
|
### warcio.js over warcprox
|
|
- Native TypeScript (v2.0+) by the Webrecorder team
|
|
- Streaming WARC read/write for both browser and Node.js
|
|
- No Python dependency
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
Network Device Internet
|
|
| |
|
|
| HTTP(S) request |
|
|
v |
|
|
[proxy] |
|
|
| |
|
|
|--> Cache hit? --> [storage] --> Serve from WARC
|
|
| |
|
|
|--> Cache miss? --------------> Fetch from Internet
|
|
| | |
|
|
| +--> [storage] ------> Write to WARC
|
|
| +--> [indexer] ------> Index in MeiliSearch
|
|
| |
|
|
v |
|
|
Response to Device |
|
|
|
|
|
[crawler] (scheduled) ------------> Crawl topics
|
|
| |
|
|
+--> [storage] ----------------> Write to WARC
|
|
+--> [indexer] ----------------> Index in MeiliSearch
|
|
```
|
|
|
|
## Package Structure
|
|
|
|
```
|
|
packages/
|
|
core/ # http-proxy-3 + mockttp based forward proxy
|
|
indexer/ # Crawlee-based topic crawler + MeiliSearch indexing
|
|
shared/ # Shared types, utilities, config schemas
|
|
|
|
apps/
|
|
web/ # Next.js landing page & admin dashboard
|
|
proxy/ # Main proxy server entry point
|
|
```
|