Add design doc for ecast shard monitor replacing Puppeteer audience approach

Replaces room-monitor.js (REST polling) and player-count-checker.js
(Puppeteer/CDP audience join) with a single EcastShardClient that
connects as a shard via direct WebSocket. Defines new event contract,
integration points, error handling, and reconnection strategy.

Made-with: Cursor
This commit is contained in:
cottongin
2026-03-20 10:42:33 -04:00
parent e6198181f8
commit 002e1d70a6

View File

@@ -0,0 +1,345 @@
# Ecast Shard Monitor — Design Document
**Date:** 2026-03-20
**Status:** Approved
**Replaces:** `room-monitor.js` (REST polling for lock) + `player-count-checker.js` (Puppeteer audience join)
## Problem
The current player count approach launches a headless Chrome instance via Puppeteer, navigates to `jackbox.tv`, joins as an audience member through the UI, and sniffs WebSocket frames via CDP. This is fragile, resource-heavy, and occupies an audience slot. The room monitor is a separate module that polls the REST API until the room locks, then hands off to the Puppeteer checker. Two modules, two connection strategies, a circular dependency workaround.
## Solution
Replace both modules with a single `EcastShardClient` that connects to the Jackbox ecast server as a **shard** via a direct Node.js WebSocket. The shard role:
- Gets the full `here` map (authoritative player list with names and roles)
- Receives real-time entity updates (room state, player joins, game end)
- Can query entities via `object/get`
- Does NOT count toward `maxPlayers` or trigger `full: true`
- Does NOT require a browser
One REST call upfront validates the room and retrieves the `host` field needed for the WebSocket URL. After that, the shard connection handles everything.
## Architecture
### Lifecycle
```
Room code registered
REST: GET /rooms/{code} ──── 404 ──→ Mark failed, stop
│ (get host, maxPlayers, locked, appTag)
WSS: Connect as shard
wss://{host}/api/v2/rooms/{code}/play?role=shard&name=GamePicker&userId=gamepicker-{sessionId}&format=json
client/welcome received
├── Parse `here` → initial player count (filter for `player` roles)
├── Parse `entities.room` → lobby state, gameCanStart, etc.
├── Store `secret` + `id` for reconnection
└── Broadcast initial state to our clients
┌─── Event loop (listening for server messages) ───┐
│ │
│ `object` (key: textDescriptions) │
│ → Parse latestDescriptions for player joins │
│ → Broadcast `lobby.player-joined` to clients │
│ │
│ `object` (key: room) │
│ → Detect state transitions: │
│ lobbyState changes → broadcast lobby updates │
│ state: "Gameplay" → broadcast `game.started` │
│ gameFinished: true → broadcast `game.ended` │
│ gameResults → extract final player count │
│ │
│ `client/connected` (if delivered to shards) │
│ → Update here map, recount players │
│ │
│ WebSocket close/error │
│ → REST check: room exists? │
│ Yes → reconnect with secret/id │
│ No → game ended, finalize │
└────────────────────────────────────────────────────┘
```
### Internal State
| Field | Type | Source |
|-------|------|--------|
| `playerCount` | number | `here` map filtered for `player` roles |
| `playerNames` | string[] | `here` map player role `name` fields |
| `lobbyState` | string | `room` entity `lobbyState` |
| `gameState` | string | `room` entity `state` (`"Lobby"`, `"Gameplay"`) |
| `gameStarted` | boolean | Derived from `state === "Gameplay"` |
| `gameFinished` | boolean | `room` entity `gameFinished` |
| `maxPlayers` | number | REST response + `room` entity |
| `secret` / `id` | string/number | `client/welcome` for reconnection |
### Player Counting
The `here` map from `client/welcome` is the authoritative source. It lists all registered connections with their roles. Count entries where `roles` contains `player`. The shard itself is excluded (it has `roles: {shard: {}}`). The host (ID 1, `roles: {host: {}}`) is also excluded. Since Jackbox holds slots for disconnected players, `here` always reflects the true occupied slot count.
For subsequent joins after connect, `textDescriptions` entity updates provide join notifications. Since shards have `here` visibility, `client/connected` messages may also be delivered — both paths are handled, with `here` as source of truth.
## WebSocket Events (Game Picker → Connected Clients)
### `room.connected`
Shard successfully connected to the Jackbox room. Sent once on initial connect. Replaces the old `audience.joined` event.
```json
{
"type": "room.connected",
"timestamp": "...",
"data": {
"sessionId": 1,
"gameId": 5,
"roomCode": "LSBN",
"appTag": "drawful2international",
"maxPlayers": 8,
"playerCount": 2,
"players": ["Alice", "Bob"],
"lobbyState": "CanStart",
"gameState": "Lobby"
}
}
```
### `lobby.player-joined`
A new player joined the lobby.
```json
{
"type": "lobby.player-joined",
"timestamp": "...",
"data": {
"sessionId": 1,
"gameId": 5,
"roomCode": "LSBN",
"playerName": "Charlie",
"playerCount": 3,
"players": ["Alice", "Bob", "Charlie"],
"maxPlayers": 8
}
}
```
### `lobby.updated`
Lobby state changed (enough players to start, countdown started, etc.).
```json
{
"type": "lobby.updated",
"timestamp": "...",
"data": {
"sessionId": 1,
"gameId": 5,
"roomCode": "LSBN",
"lobbyState": "Countdown",
"gameCanStart": true,
"gameIsStarting": true,
"playerCount": 4
}
}
```
### `game.started`
The game transitioned from Lobby to Gameplay.
```json
{
"type": "game.started",
"timestamp": "...",
"data": {
"sessionId": 1,
"gameId": 5,
"roomCode": "LSBN",
"playerCount": 4,
"players": ["Alice", "Bob", "Charlie", "Diana"],
"maxPlayers": 8
}
}
```
### `game.ended`
The game finished.
```json
{
"type": "game.ended",
"timestamp": "...",
"data": {
"sessionId": 1,
"gameId": 5,
"roomCode": "LSBN",
"playerCount": 4,
"players": ["Alice", "Bob", "Charlie", "Diana"]
}
}
```
### `room.disconnected`
Shard lost connection to the Jackbox room.
```json
{
"type": "room.disconnected",
"timestamp": "...",
"data": {
"sessionId": 1,
"gameId": 5,
"roomCode": "LSBN",
"reason": "room_closed",
"finalPlayerCount": 4
}
}
```
Possible `reason` values: `room_closed`, `room_not_found`, `connection_failed`, `role_rejected`, `manually_stopped`.
### Dropped Events
| Old event | Replacement |
|-----------|-------------|
| `audience.joined` | `room.connected` (richer payload) |
| `player-count.updated` (automated) | `lobby.player-joined`, `game.started`, `game.ended` carry `playerCount` |
The manual `PATCH .../player-count` endpoint keeps broadcasting `player-count.updated` for its specific use case.
### DB Persistence
The `session_games` table columns `player_count` and `player_count_check_status` continue to be updated:
- `player_count` — updated on each join and at game end
- `player_count_check_status``'monitoring'` (shard connected), `'completed'` (game ended with count), `'failed'` (couldn't connect), `'stopped'` (manual stop)
The old `'checking'` status becomes `'monitoring'`.
## Integration Points
### Files Deleted
- `backend/utils/player-count-checker.js` — Puppeteer audience approach
- `backend/utils/room-monitor.js` — REST polling for lock state
### Files Created
- `backend/utils/ecast-shard-client.js``EcastShardClient` class + module exports: `startMonitor`, `stopMonitor`, `cleanupAllShards`
### Files Modified
**`backend/utils/jackbox-api.js`** — Add `getRoomInfo(roomCode)` returning the full room response including `host`, `appTag`, `audienceEnabled`.
**`backend/routes/sessions.js`** — Replace imports:
```javascript
// Old
const { stopPlayerCountCheck } = require('../utils/player-count-checker');
const { startRoomMonitor, stopRoomMonitor } = require('../utils/room-monitor');
// New
const { startMonitor, stopMonitor } = require('../utils/ecast-shard-client');
```
All call sites change from two-function calls to one:
| Route | Old | New |
|-------|-----|-----|
| `POST /:id/games` (with room_code) | `startRoomMonitor(...)` | `startMonitor(...)` |
| `PATCH .../status` (away from playing) | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
| `DELETE .../games/:gameId` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
| `POST .../start-player-check` | `startRoomMonitor(...)` | `startMonitor(...)` |
| `POST .../stop-player-check` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
Endpoint paths stay the same for backwards compatibility.
**`backend/server.js`** — Wire `cleanupAllShards()` into `SIGTERM`/`SIGINT` handlers.
## Error Handling and Reconnection
### Connection Failures
1. **REST validation fails** (room not found, network error): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'room_not_found'` or `'connection_failed'`. No automatic retry.
2. **Shard WebSocket fails to connect**: Retry up to 3 times with exponential backoff (2s, 4s, 8s). On exhaustion, set status `'failed'`, broadcast `room.disconnected` with `reason: 'connection_failed'`.
3. **Ecast rejects the shard role** (error opcode received): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'role_rejected'`. No retry.
### Mid-Session Disconnections
4. **WebSocket closes unexpectedly**: REST check `GET /rooms/{code}`:
- Room exists → reconnect with stored `secret`/`id` (up to 3 attempts, exponential backoff). Transparent to clients on success.
- Room gone → finalize with last known count, status `'completed'`, broadcast `game.ended` + `room.disconnected`.
5. **Ecast error 2027 "room already closed"**: Same as room-gone path.
### Manual Stop
6. **`stop-player-check` called or game status changes**: Close WebSocket gracefully, set status `'stopped'` (unless already `'completed'`), broadcast `room.disconnected` with `reason: 'manually_stopped'`.
### Server Shutdown
7. **`SIGTERM`/`SIGINT`**: `cleanupAllShards()` closes all WebSocket connections. No DB updates on shutdown.
### State Machine
```
startMonitor()
┌───────────┐
┌────────│ not_started│
│ └───────────┘
│ │
REST fails REST succeeds
│ │
▼ ▼
┌────────┐ ┌────────────┐
│ failed │ │ monitoring │◄──── reconnect success
└────────┘ └─────┬──────┘
▲ │
│ ┌────┴─────┬──────────────┐
reconnect │ │ │
exhausted game ends WS drops manual stop
│ │ │ │
│ ▼ ▼ ▼
│ ┌──────────┐ REST check ┌─────────┐
│ │ completed │ │ │ stopped │
│ └──────────┘ │ └─────────┘
│ │
└──── room gone? ────┘
room exists?
reconnect...
```
### Timeouts
| Concern | Value | Rationale |
|---------|-------|-----------|
| WebSocket connect timeout | 10s | Ecast servers respond fast |
| Reconnect backoff | 2s, 4s, 8s | Three attempts, ~14s total |
| Max reconnect attempts | 3 | Fail fast, user can retry manually |
| WebSocket inactivity timeout | None | Shard connections receive periodic `shard/sync` CRDT messages |
## Dependencies
**Added:** `ws` (Node.js WebSocket library) — already a dependency (used by `websocket-manager.js`).
**Removed:** `puppeteer` — no longer needed for room monitoring.
## Non-Goals
- Renaming REST endpoint paths (`start-player-check` / `stop-player-check`) — kept for backwards compatibility
- Auto-starting monitoring when room code is set via `PATCH .../room-code` — kept as manual trigger only
- Frontend `Picker.jsx` changes — tracked separately (existing bugs: `message.event` vs `message.type`, subscribe without auth, `'waiting'` status that's never set)