From 002e1d70a604d4bd634be19d5b2952c8ca81652f Mon Sep 17 00:00:00 2001 From: cottongin Date: Fri, 20 Mar 2026 10:42:33 -0400 Subject: [PATCH] Add design doc for ecast shard monitor replacing Puppeteer audience approach Replaces room-monitor.js (REST polling) and player-count-checker.js (Puppeteer/CDP audience join) with a single EcastShardClient that connects as a shard via direct WebSocket. Defines new event contract, integration points, error handling, and reconnection strategy. Made-with: Cursor --- docs/plans/2026-03-20-shard-monitor-design.md | 345 ++++++++++++++++++ 1 file changed, 345 insertions(+) create mode 100644 docs/plans/2026-03-20-shard-monitor-design.md diff --git a/docs/plans/2026-03-20-shard-monitor-design.md b/docs/plans/2026-03-20-shard-monitor-design.md new file mode 100644 index 0000000..469686b --- /dev/null +++ b/docs/plans/2026-03-20-shard-monitor-design.md @@ -0,0 +1,345 @@ +# Ecast Shard Monitor — Design Document + +**Date:** 2026-03-20 +**Status:** Approved +**Replaces:** `room-monitor.js` (REST polling for lock) + `player-count-checker.js` (Puppeteer audience join) + +## Problem + +The current player count approach launches a headless Chrome instance via Puppeteer, navigates to `jackbox.tv`, joins as an audience member through the UI, and sniffs WebSocket frames via CDP. This is fragile, resource-heavy, and occupies an audience slot. The room monitor is a separate module that polls the REST API until the room locks, then hands off to the Puppeteer checker. Two modules, two connection strategies, a circular dependency workaround. + +## Solution + +Replace both modules with a single `EcastShardClient` that connects to the Jackbox ecast server as a **shard** via a direct Node.js WebSocket. The shard role: + +- Gets the full `here` map (authoritative player list with names and roles) +- Receives real-time entity updates (room state, player joins, game end) +- Can query entities via `object/get` +- Does NOT count toward `maxPlayers` or trigger `full: true` +- Does NOT require a browser + +One REST call upfront validates the room and retrieves the `host` field needed for the WebSocket URL. After that, the shard connection handles everything. + +## Architecture + +### Lifecycle + +``` +Room code registered + │ + ▼ + REST: GET /rooms/{code} ──── 404 ──→ Mark failed, stop + │ + │ (get host, maxPlayers, locked, appTag) + ▼ + WSS: Connect as shard + wss://{host}/api/v2/rooms/{code}/play?role=shard&name=GamePicker&userId=gamepicker-{sessionId}&format=json + │ + ▼ + client/welcome received + ├── Parse `here` → initial player count (filter for `player` roles) + ├── Parse `entities.room` → lobby state, gameCanStart, etc. + ├── Store `secret` + `id` for reconnection + └── Broadcast initial state to our clients + │ + ▼ + ┌─── Event loop (listening for server messages) ───┐ + │ │ + │ `object` (key: textDescriptions) │ + │ → Parse latestDescriptions for player joins │ + │ → Broadcast `lobby.player-joined` to clients │ + │ │ + │ `object` (key: room) │ + │ → Detect state transitions: │ + │ lobbyState changes → broadcast lobby updates │ + │ state: "Gameplay" → broadcast `game.started` │ + │ gameFinished: true → broadcast `game.ended` │ + │ gameResults → extract final player count │ + │ │ + │ `client/connected` (if delivered to shards) │ + │ → Update here map, recount players │ + │ │ + │ WebSocket close/error │ + │ → REST check: room exists? │ + │ Yes → reconnect with secret/id │ + │ No → game ended, finalize │ + └────────────────────────────────────────────────────┘ +``` + +### Internal State + +| Field | Type | Source | +|-------|------|--------| +| `playerCount` | number | `here` map filtered for `player` roles | +| `playerNames` | string[] | `here` map player role `name` fields | +| `lobbyState` | string | `room` entity `lobbyState` | +| `gameState` | string | `room` entity `state` (`"Lobby"`, `"Gameplay"`) | +| `gameStarted` | boolean | Derived from `state === "Gameplay"` | +| `gameFinished` | boolean | `room` entity `gameFinished` | +| `maxPlayers` | number | REST response + `room` entity | +| `secret` / `id` | string/number | `client/welcome` for reconnection | + +### Player Counting + +The `here` map from `client/welcome` is the authoritative source. It lists all registered connections with their roles. Count entries where `roles` contains `player`. The shard itself is excluded (it has `roles: {shard: {}}`). The host (ID 1, `roles: {host: {}}`) is also excluded. Since Jackbox holds slots for disconnected players, `here` always reflects the true occupied slot count. + +For subsequent joins after connect, `textDescriptions` entity updates provide join notifications. Since shards have `here` visibility, `client/connected` messages may also be delivered — both paths are handled, with `here` as source of truth. + +## WebSocket Events (Game Picker → Connected Clients) + +### `room.connected` + +Shard successfully connected to the Jackbox room. Sent once on initial connect. Replaces the old `audience.joined` event. + +```json +{ + "type": "room.connected", + "timestamp": "...", + "data": { + "sessionId": 1, + "gameId": 5, + "roomCode": "LSBN", + "appTag": "drawful2international", + "maxPlayers": 8, + "playerCount": 2, + "players": ["Alice", "Bob"], + "lobbyState": "CanStart", + "gameState": "Lobby" + } +} +``` + +### `lobby.player-joined` + +A new player joined the lobby. + +```json +{ + "type": "lobby.player-joined", + "timestamp": "...", + "data": { + "sessionId": 1, + "gameId": 5, + "roomCode": "LSBN", + "playerName": "Charlie", + "playerCount": 3, + "players": ["Alice", "Bob", "Charlie"], + "maxPlayers": 8 + } +} +``` + +### `lobby.updated` + +Lobby state changed (enough players to start, countdown started, etc.). + +```json +{ + "type": "lobby.updated", + "timestamp": "...", + "data": { + "sessionId": 1, + "gameId": 5, + "roomCode": "LSBN", + "lobbyState": "Countdown", + "gameCanStart": true, + "gameIsStarting": true, + "playerCount": 4 + } +} +``` + +### `game.started` + +The game transitioned from Lobby to Gameplay. + +```json +{ + "type": "game.started", + "timestamp": "...", + "data": { + "sessionId": 1, + "gameId": 5, + "roomCode": "LSBN", + "playerCount": 4, + "players": ["Alice", "Bob", "Charlie", "Diana"], + "maxPlayers": 8 + } +} +``` + +### `game.ended` + +The game finished. + +```json +{ + "type": "game.ended", + "timestamp": "...", + "data": { + "sessionId": 1, + "gameId": 5, + "roomCode": "LSBN", + "playerCount": 4, + "players": ["Alice", "Bob", "Charlie", "Diana"] + } +} +``` + +### `room.disconnected` + +Shard lost connection to the Jackbox room. + +```json +{ + "type": "room.disconnected", + "timestamp": "...", + "data": { + "sessionId": 1, + "gameId": 5, + "roomCode": "LSBN", + "reason": "room_closed", + "finalPlayerCount": 4 + } +} +``` + +Possible `reason` values: `room_closed`, `room_not_found`, `connection_failed`, `role_rejected`, `manually_stopped`. + +### Dropped Events + +| Old event | Replacement | +|-----------|-------------| +| `audience.joined` | `room.connected` (richer payload) | +| `player-count.updated` (automated) | `lobby.player-joined`, `game.started`, `game.ended` carry `playerCount` | + +The manual `PATCH .../player-count` endpoint keeps broadcasting `player-count.updated` for its specific use case. + +### DB Persistence + +The `session_games` table columns `player_count` and `player_count_check_status` continue to be updated: + +- `player_count` — updated on each join and at game end +- `player_count_check_status` — `'monitoring'` (shard connected), `'completed'` (game ended with count), `'failed'` (couldn't connect), `'stopped'` (manual stop) + +The old `'checking'` status becomes `'monitoring'`. + +## Integration Points + +### Files Deleted + +- `backend/utils/player-count-checker.js` — Puppeteer audience approach +- `backend/utils/room-monitor.js` — REST polling for lock state + +### Files Created + +- `backend/utils/ecast-shard-client.js` — `EcastShardClient` class + module exports: `startMonitor`, `stopMonitor`, `cleanupAllShards` + +### Files Modified + +**`backend/utils/jackbox-api.js`** — Add `getRoomInfo(roomCode)` returning the full room response including `host`, `appTag`, `audienceEnabled`. + +**`backend/routes/sessions.js`** — Replace imports: + +```javascript +// Old +const { stopPlayerCountCheck } = require('../utils/player-count-checker'); +const { startRoomMonitor, stopRoomMonitor } = require('../utils/room-monitor'); + +// New +const { startMonitor, stopMonitor } = require('../utils/ecast-shard-client'); +``` + +All call sites change from two-function calls to one: + +| Route | Old | New | +|-------|-----|-----| +| `POST /:id/games` (with room_code) | `startRoomMonitor(...)` | `startMonitor(...)` | +| `PATCH .../status` (away from playing) | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` | +| `DELETE .../games/:gameId` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` | +| `POST .../start-player-check` | `startRoomMonitor(...)` | `startMonitor(...)` | +| `POST .../stop-player-check` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` | + +Endpoint paths stay the same for backwards compatibility. + +**`backend/server.js`** — Wire `cleanupAllShards()` into `SIGTERM`/`SIGINT` handlers. + +## Error Handling and Reconnection + +### Connection Failures + +1. **REST validation fails** (room not found, network error): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'room_not_found'` or `'connection_failed'`. No automatic retry. + +2. **Shard WebSocket fails to connect**: Retry up to 3 times with exponential backoff (2s, 4s, 8s). On exhaustion, set status `'failed'`, broadcast `room.disconnected` with `reason: 'connection_failed'`. + +3. **Ecast rejects the shard role** (error opcode received): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'role_rejected'`. No retry. + +### Mid-Session Disconnections + +4. **WebSocket closes unexpectedly**: REST check `GET /rooms/{code}`: + - Room exists → reconnect with stored `secret`/`id` (up to 3 attempts, exponential backoff). Transparent to clients on success. + - Room gone → finalize with last known count, status `'completed'`, broadcast `game.ended` + `room.disconnected`. + +5. **Ecast error 2027 "room already closed"**: Same as room-gone path. + +### Manual Stop + +6. **`stop-player-check` called or game status changes**: Close WebSocket gracefully, set status `'stopped'` (unless already `'completed'`), broadcast `room.disconnected` with `reason: 'manually_stopped'`. + +### Server Shutdown + +7. **`SIGTERM`/`SIGINT`**: `cleanupAllShards()` closes all WebSocket connections. No DB updates on shutdown. + +### State Machine + +``` + startMonitor() + │ + ▼ + ┌───────────┐ + ┌────────│ not_started│ + │ └───────────┘ + │ │ + REST fails REST succeeds + │ │ + ▼ ▼ + ┌────────┐ ┌────────────┐ + │ failed │ │ monitoring │◄──── reconnect success + └────────┘ └─────┬──────┘ + ▲ │ + │ ┌────┴─────┬──────────────┐ + reconnect │ │ │ + exhausted game ends WS drops manual stop + │ │ │ │ + │ ▼ ▼ ▼ + │ ┌──────────┐ REST check ┌─────────┐ + │ │ completed │ │ │ stopped │ + │ └──────────┘ │ └─────────┘ + │ │ + └──── room gone? ────┘ + │ + room exists? + │ + reconnect... +``` + +### Timeouts + +| Concern | Value | Rationale | +|---------|-------|-----------| +| WebSocket connect timeout | 10s | Ecast servers respond fast | +| Reconnect backoff | 2s, 4s, 8s | Three attempts, ~14s total | +| Max reconnect attempts | 3 | Fail fast, user can retry manually | +| WebSocket inactivity timeout | None | Shard connections receive periodic `shard/sync` CRDT messages | + +## Dependencies + +**Added:** `ws` (Node.js WebSocket library) — already a dependency (used by `websocket-manager.js`). + +**Removed:** `puppeteer` — no longer needed for room monitoring. + +## Non-Goals + +- Renaming REST endpoint paths (`start-player-check` / `stop-player-check`) — kept for backwards compatibility +- Auto-starting monitoring when room code is set via `PATCH .../room-code` — kept as manual trigger only +- Frontend `Picker.jsx` changes — tracked separately (existing bugs: `message.event` vs `message.type`, subscribe without auth, `'waiting'` status that's never set)