# Ecast Shard Monitor — Design Document **Date:** 2026-03-20 **Status:** Approved **Replaces:** `room-monitor.js` (REST polling for lock) + `player-count-checker.js` (Puppeteer audience join) ## Problem The current player count approach launches a headless Chrome instance via Puppeteer, navigates to `jackbox.tv`, joins as an audience member through the UI, and sniffs WebSocket frames via CDP. This is fragile, resource-heavy, and occupies an audience slot. The room monitor is a separate module that polls the REST API until the room locks, then hands off to the Puppeteer checker. Two modules, two connection strategies, a circular dependency workaround. ## Solution Replace both modules with a single `EcastShardClient` that connects to the Jackbox ecast server as a **shard** via a direct Node.js WebSocket. The shard role: - Gets the full `here` map (authoritative player list with names and roles) - Receives real-time entity updates (room state, player joins, game end) - Can query entities via `object/get` - Does NOT count toward `maxPlayers` or trigger `full: true` - Does NOT require a browser One REST call upfront validates the room and retrieves the `host` field needed for the WebSocket URL. After that, the shard connection handles everything. ## Architecture ### Lifecycle ``` Room code registered │ ▼ REST: GET /rooms/{code} ──── 404 ──→ Mark failed, stop │ │ (get host, maxPlayers, locked, appTag) ▼ WSS: Connect as shard wss://{host}/api/v2/rooms/{code}/play?role=shard&name=GamePicker&userId=gamepicker-{sessionId}&format=json │ ▼ client/welcome received ├── Parse `here` → initial player count (filter for `player` roles) ├── Parse `entities.room` → lobby state, gameCanStart, etc. ├── Store `secret` + `id` for reconnection └── Broadcast initial state to our clients │ ▼ ┌─── Event loop (listening for server messages) ───┐ │ │ │ `object` (key: textDescriptions) │ │ → Parse latestDescriptions for player joins │ │ → Broadcast `lobby.player-joined` to clients │ │ │ │ `object` (key: room) │ │ → Detect state transitions: │ │ lobbyState changes → broadcast lobby updates │ │ state: "Gameplay" → broadcast `game.started` │ │ gameFinished: true → broadcast `game.ended` │ │ gameResults → extract final player count │ │ │ │ `client/connected` (if delivered to shards) │ │ → Update here map, recount players │ │ │ │ WebSocket close/error │ │ → REST check: room exists? │ │ Yes → reconnect with secret/id │ │ No → game ended, finalize │ └────────────────────────────────────────────────────┘ ``` ### Internal State | Field | Type | Source | |-------|------|--------| | `playerCount` | number | `here` map filtered for `player` roles | | `playerNames` | string[] | `here` map player role `name` fields | | `lobbyState` | string | `room` entity `lobbyState` | | `gameState` | string | `room` entity `state` (`"Lobby"`, `"Gameplay"`) | | `gameStarted` | boolean | Derived from `state === "Gameplay"` | | `gameFinished` | boolean | `room` entity `gameFinished` | | `maxPlayers` | number | REST response + `room` entity | | `secret` / `id` | string/number | `client/welcome` for reconnection | ### Player Counting The `here` map from `client/welcome` is the authoritative source. It lists all registered connections with their roles. Count entries where `roles` contains `player`. The shard itself is excluded (it has `roles: {shard: {}}`). The host (ID 1, `roles: {host: {}}`) is also excluded. Since Jackbox holds slots for disconnected players, `here` always reflects the true occupied slot count. For subsequent joins after connect, `textDescriptions` entity updates provide join notifications. Since shards have `here` visibility, `client/connected` messages may also be delivered — both paths are handled, with `here` as source of truth. ## WebSocket Events (Game Picker → Connected Clients) ### `room.connected` Shard successfully connected to the Jackbox room. Sent once on initial connect. Replaces the old `audience.joined` event. ```json { "type": "room.connected", "timestamp": "...", "data": { "sessionId": 1, "gameId": 5, "roomCode": "LSBN", "appTag": "drawful2international", "maxPlayers": 8, "playerCount": 2, "players": ["Alice", "Bob"], "lobbyState": "CanStart", "gameState": "Lobby" } } ``` ### `lobby.player-joined` A new player joined the lobby. ```json { "type": "lobby.player-joined", "timestamp": "...", "data": { "sessionId": 1, "gameId": 5, "roomCode": "LSBN", "playerName": "Charlie", "playerCount": 3, "players": ["Alice", "Bob", "Charlie"], "maxPlayers": 8 } } ``` ### `lobby.updated` Lobby state changed (enough players to start, countdown started, etc.). ```json { "type": "lobby.updated", "timestamp": "...", "data": { "sessionId": 1, "gameId": 5, "roomCode": "LSBN", "lobbyState": "Countdown", "gameCanStart": true, "gameIsStarting": true, "playerCount": 4 } } ``` ### `game.started` The game transitioned from Lobby to Gameplay. ```json { "type": "game.started", "timestamp": "...", "data": { "sessionId": 1, "gameId": 5, "roomCode": "LSBN", "playerCount": 4, "players": ["Alice", "Bob", "Charlie", "Diana"], "maxPlayers": 8 } } ``` ### `game.ended` The game finished. ```json { "type": "game.ended", "timestamp": "...", "data": { "sessionId": 1, "gameId": 5, "roomCode": "LSBN", "playerCount": 4, "players": ["Alice", "Bob", "Charlie", "Diana"] } } ``` ### `room.disconnected` Shard lost connection to the Jackbox room. ```json { "type": "room.disconnected", "timestamp": "...", "data": { "sessionId": 1, "gameId": 5, "roomCode": "LSBN", "reason": "room_closed", "finalPlayerCount": 4 } } ``` Possible `reason` values: `room_closed`, `room_not_found`, `connection_failed`, `role_rejected`, `manually_stopped`. ### Dropped Events | Old event | Replacement | |-----------|-------------| | `audience.joined` | `room.connected` (richer payload) | | `player-count.updated` (automated) | `lobby.player-joined`, `game.started`, `game.ended` carry `playerCount` | The manual `PATCH .../player-count` endpoint keeps broadcasting `player-count.updated` for its specific use case. ### DB Persistence The `session_games` table columns `player_count` and `player_count_check_status` continue to be updated: - `player_count` — updated on each join and at game end - `player_count_check_status` — `'monitoring'` (shard connected), `'completed'` (game ended with count), `'failed'` (couldn't connect), `'stopped'` (manual stop) The old `'checking'` status becomes `'monitoring'`. ## Integration Points ### Files Deleted - `backend/utils/player-count-checker.js` — Puppeteer audience approach - `backend/utils/room-monitor.js` — REST polling for lock state ### Files Created - `backend/utils/ecast-shard-client.js` — `EcastShardClient` class + module exports: `startMonitor`, `stopMonitor`, `cleanupAllShards` ### Files Modified **`backend/utils/jackbox-api.js`** — Add `getRoomInfo(roomCode)` returning the full room response including `host`, `appTag`, `audienceEnabled`. **`backend/routes/sessions.js`** — Replace imports: ```javascript // Old const { stopPlayerCountCheck } = require('../utils/player-count-checker'); const { startRoomMonitor, stopRoomMonitor } = require('../utils/room-monitor'); // New const { startMonitor, stopMonitor } = require('../utils/ecast-shard-client'); ``` All call sites change from two-function calls to one: | Route | Old | New | |-------|-----|-----| | `POST /:id/games` (with room_code) | `startRoomMonitor(...)` | `startMonitor(...)` | | `PATCH .../status` (away from playing) | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` | | `DELETE .../games/:gameId` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` | | `POST .../start-player-check` | `startRoomMonitor(...)` | `startMonitor(...)` | | `POST .../stop-player-check` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` | Endpoint paths stay the same for backwards compatibility. **`backend/server.js`** — Wire `cleanupAllShards()` into `SIGTERM`/`SIGINT` handlers. ## Error Handling and Reconnection ### Connection Failures 1. **REST validation fails** (room not found, network error): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'room_not_found'` or `'connection_failed'`. No automatic retry. 2. **Shard WebSocket fails to connect**: Retry up to 3 times with exponential backoff (2s, 4s, 8s). On exhaustion, set status `'failed'`, broadcast `room.disconnected` with `reason: 'connection_failed'`. 3. **Ecast rejects the shard role** (error opcode received): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'role_rejected'`. No retry. ### Mid-Session Disconnections 4. **WebSocket closes unexpectedly**: REST check `GET /rooms/{code}`: - Room exists → reconnect with stored `secret`/`id` (up to 3 attempts, exponential backoff). Transparent to clients on success. - Room gone → finalize with last known count, status `'completed'`, broadcast `game.ended` + `room.disconnected`. 5. **Ecast error 2027 "room already closed"**: Same as room-gone path. ### Manual Stop 6. **`stop-player-check` called or game status changes**: Close WebSocket gracefully, set status `'stopped'` (unless already `'completed'`), broadcast `room.disconnected` with `reason: 'manually_stopped'`. ### Server Shutdown 7. **`SIGTERM`/`SIGINT`**: `cleanupAllShards()` closes all WebSocket connections. No DB updates on shutdown. ### State Machine ``` startMonitor() │ ▼ ┌───────────┐ ┌────────│ not_started│ │ └───────────┘ │ │ REST fails REST succeeds │ │ ▼ ▼ ┌────────┐ ┌────────────┐ │ failed │ │ monitoring │◄──── reconnect success └────────┘ └─────┬──────┘ ▲ │ │ ┌────┴─────┬──────────────┐ reconnect │ │ │ exhausted game ends WS drops manual stop │ │ │ │ │ ▼ ▼ ▼ │ ┌──────────┐ REST check ┌─────────┐ │ │ completed │ │ │ stopped │ │ └──────────┘ │ └─────────┘ │ │ └──── room gone? ────┘ │ room exists? │ reconnect... ``` ### Timeouts | Concern | Value | Rationale | |---------|-------|-----------| | WebSocket connect timeout | 10s | Ecast servers respond fast | | Reconnect backoff | 2s, 4s, 8s | Three attempts, ~14s total | | Max reconnect attempts | 3 | Fail fast, user can retry manually | | WebSocket inactivity timeout | None | Shard connections receive periodic `shard/sync` CRDT messages | ## Dependencies **Added:** `ws` (Node.js WebSocket library) — already a dependency (used by `websocket-manager.js`). **Removed:** `puppeteer` — no longer needed for room monitoring. ## Non-Goals - Renaming REST endpoint paths (`start-player-check` / `stop-player-check`) — kept for backwards compatibility - Auto-starting monitoring when room code is set via `PATCH .../room-code` — kept as manual trigger only - Frontend `Picker.jsx` changes — tracked separately (existing bugs: `message.event` vs `message.type`, subscribe without auth, `'waiting'` status that's never set)