Add design doc for ecast shard monitor replacing Puppeteer audience approach
Replaces room-monitor.js (REST polling) and player-count-checker.js (Puppeteer/CDP audience join) with a single EcastShardClient that connects as a shard via direct WebSocket. Defines new event contract, integration points, error handling, and reconnection strategy. Made-with: Cursor
This commit is contained in:
345
docs/plans/2026-03-20-shard-monitor-design.md
Normal file
345
docs/plans/2026-03-20-shard-monitor-design.md
Normal file
@@ -0,0 +1,345 @@
|
|||||||
|
# Ecast Shard Monitor — Design Document
|
||||||
|
|
||||||
|
**Date:** 2026-03-20
|
||||||
|
**Status:** Approved
|
||||||
|
**Replaces:** `room-monitor.js` (REST polling for lock) + `player-count-checker.js` (Puppeteer audience join)
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The current player count approach launches a headless Chrome instance via Puppeteer, navigates to `jackbox.tv`, joins as an audience member through the UI, and sniffs WebSocket frames via CDP. This is fragile, resource-heavy, and occupies an audience slot. The room monitor is a separate module that polls the REST API until the room locks, then hands off to the Puppeteer checker. Two modules, two connection strategies, a circular dependency workaround.
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
|
||||||
|
Replace both modules with a single `EcastShardClient` that connects to the Jackbox ecast server as a **shard** via a direct Node.js WebSocket. The shard role:
|
||||||
|
|
||||||
|
- Gets the full `here` map (authoritative player list with names and roles)
|
||||||
|
- Receives real-time entity updates (room state, player joins, game end)
|
||||||
|
- Can query entities via `object/get`
|
||||||
|
- Does NOT count toward `maxPlayers` or trigger `full: true`
|
||||||
|
- Does NOT require a browser
|
||||||
|
|
||||||
|
One REST call upfront validates the room and retrieves the `host` field needed for the WebSocket URL. After that, the shard connection handles everything.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Lifecycle
|
||||||
|
|
||||||
|
```
|
||||||
|
Room code registered
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
REST: GET /rooms/{code} ──── 404 ──→ Mark failed, stop
|
||||||
|
│
|
||||||
|
│ (get host, maxPlayers, locked, appTag)
|
||||||
|
▼
|
||||||
|
WSS: Connect as shard
|
||||||
|
wss://{host}/api/v2/rooms/{code}/play?role=shard&name=GamePicker&userId=gamepicker-{sessionId}&format=json
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
client/welcome received
|
||||||
|
├── Parse `here` → initial player count (filter for `player` roles)
|
||||||
|
├── Parse `entities.room` → lobby state, gameCanStart, etc.
|
||||||
|
├── Store `secret` + `id` for reconnection
|
||||||
|
└── Broadcast initial state to our clients
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─── Event loop (listening for server messages) ───┐
|
||||||
|
│ │
|
||||||
|
│ `object` (key: textDescriptions) │
|
||||||
|
│ → Parse latestDescriptions for player joins │
|
||||||
|
│ → Broadcast `lobby.player-joined` to clients │
|
||||||
|
│ │
|
||||||
|
│ `object` (key: room) │
|
||||||
|
│ → Detect state transitions: │
|
||||||
|
│ lobbyState changes → broadcast lobby updates │
|
||||||
|
│ state: "Gameplay" → broadcast `game.started` │
|
||||||
|
│ gameFinished: true → broadcast `game.ended` │
|
||||||
|
│ gameResults → extract final player count │
|
||||||
|
│ │
|
||||||
|
│ `client/connected` (if delivered to shards) │
|
||||||
|
│ → Update here map, recount players │
|
||||||
|
│ │
|
||||||
|
│ WebSocket close/error │
|
||||||
|
│ → REST check: room exists? │
|
||||||
|
│ Yes → reconnect with secret/id │
|
||||||
|
│ No → game ended, finalize │
|
||||||
|
└────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Internal State
|
||||||
|
|
||||||
|
| Field | Type | Source |
|
||||||
|
|-------|------|--------|
|
||||||
|
| `playerCount` | number | `here` map filtered for `player` roles |
|
||||||
|
| `playerNames` | string[] | `here` map player role `name` fields |
|
||||||
|
| `lobbyState` | string | `room` entity `lobbyState` |
|
||||||
|
| `gameState` | string | `room` entity `state` (`"Lobby"`, `"Gameplay"`) |
|
||||||
|
| `gameStarted` | boolean | Derived from `state === "Gameplay"` |
|
||||||
|
| `gameFinished` | boolean | `room` entity `gameFinished` |
|
||||||
|
| `maxPlayers` | number | REST response + `room` entity |
|
||||||
|
| `secret` / `id` | string/number | `client/welcome` for reconnection |
|
||||||
|
|
||||||
|
### Player Counting
|
||||||
|
|
||||||
|
The `here` map from `client/welcome` is the authoritative source. It lists all registered connections with their roles. Count entries where `roles` contains `player`. The shard itself is excluded (it has `roles: {shard: {}}`). The host (ID 1, `roles: {host: {}}`) is also excluded. Since Jackbox holds slots for disconnected players, `here` always reflects the true occupied slot count.
|
||||||
|
|
||||||
|
For subsequent joins after connect, `textDescriptions` entity updates provide join notifications. Since shards have `here` visibility, `client/connected` messages may also be delivered — both paths are handled, with `here` as source of truth.
|
||||||
|
|
||||||
|
## WebSocket Events (Game Picker → Connected Clients)
|
||||||
|
|
||||||
|
### `room.connected`
|
||||||
|
|
||||||
|
Shard successfully connected to the Jackbox room. Sent once on initial connect. Replaces the old `audience.joined` event.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "room.connected",
|
||||||
|
"timestamp": "...",
|
||||||
|
"data": {
|
||||||
|
"sessionId": 1,
|
||||||
|
"gameId": 5,
|
||||||
|
"roomCode": "LSBN",
|
||||||
|
"appTag": "drawful2international",
|
||||||
|
"maxPlayers": 8,
|
||||||
|
"playerCount": 2,
|
||||||
|
"players": ["Alice", "Bob"],
|
||||||
|
"lobbyState": "CanStart",
|
||||||
|
"gameState": "Lobby"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### `lobby.player-joined`
|
||||||
|
|
||||||
|
A new player joined the lobby.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "lobby.player-joined",
|
||||||
|
"timestamp": "...",
|
||||||
|
"data": {
|
||||||
|
"sessionId": 1,
|
||||||
|
"gameId": 5,
|
||||||
|
"roomCode": "LSBN",
|
||||||
|
"playerName": "Charlie",
|
||||||
|
"playerCount": 3,
|
||||||
|
"players": ["Alice", "Bob", "Charlie"],
|
||||||
|
"maxPlayers": 8
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### `lobby.updated`
|
||||||
|
|
||||||
|
Lobby state changed (enough players to start, countdown started, etc.).
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "lobby.updated",
|
||||||
|
"timestamp": "...",
|
||||||
|
"data": {
|
||||||
|
"sessionId": 1,
|
||||||
|
"gameId": 5,
|
||||||
|
"roomCode": "LSBN",
|
||||||
|
"lobbyState": "Countdown",
|
||||||
|
"gameCanStart": true,
|
||||||
|
"gameIsStarting": true,
|
||||||
|
"playerCount": 4
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### `game.started`
|
||||||
|
|
||||||
|
The game transitioned from Lobby to Gameplay.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "game.started",
|
||||||
|
"timestamp": "...",
|
||||||
|
"data": {
|
||||||
|
"sessionId": 1,
|
||||||
|
"gameId": 5,
|
||||||
|
"roomCode": "LSBN",
|
||||||
|
"playerCount": 4,
|
||||||
|
"players": ["Alice", "Bob", "Charlie", "Diana"],
|
||||||
|
"maxPlayers": 8
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### `game.ended`
|
||||||
|
|
||||||
|
The game finished.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "game.ended",
|
||||||
|
"timestamp": "...",
|
||||||
|
"data": {
|
||||||
|
"sessionId": 1,
|
||||||
|
"gameId": 5,
|
||||||
|
"roomCode": "LSBN",
|
||||||
|
"playerCount": 4,
|
||||||
|
"players": ["Alice", "Bob", "Charlie", "Diana"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### `room.disconnected`
|
||||||
|
|
||||||
|
Shard lost connection to the Jackbox room.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "room.disconnected",
|
||||||
|
"timestamp": "...",
|
||||||
|
"data": {
|
||||||
|
"sessionId": 1,
|
||||||
|
"gameId": 5,
|
||||||
|
"roomCode": "LSBN",
|
||||||
|
"reason": "room_closed",
|
||||||
|
"finalPlayerCount": 4
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Possible `reason` values: `room_closed`, `room_not_found`, `connection_failed`, `role_rejected`, `manually_stopped`.
|
||||||
|
|
||||||
|
### Dropped Events
|
||||||
|
|
||||||
|
| Old event | Replacement |
|
||||||
|
|-----------|-------------|
|
||||||
|
| `audience.joined` | `room.connected` (richer payload) |
|
||||||
|
| `player-count.updated` (automated) | `lobby.player-joined`, `game.started`, `game.ended` carry `playerCount` |
|
||||||
|
|
||||||
|
The manual `PATCH .../player-count` endpoint keeps broadcasting `player-count.updated` for its specific use case.
|
||||||
|
|
||||||
|
### DB Persistence
|
||||||
|
|
||||||
|
The `session_games` table columns `player_count` and `player_count_check_status` continue to be updated:
|
||||||
|
|
||||||
|
- `player_count` — updated on each join and at game end
|
||||||
|
- `player_count_check_status` — `'monitoring'` (shard connected), `'completed'` (game ended with count), `'failed'` (couldn't connect), `'stopped'` (manual stop)
|
||||||
|
|
||||||
|
The old `'checking'` status becomes `'monitoring'`.
|
||||||
|
|
||||||
|
## Integration Points
|
||||||
|
|
||||||
|
### Files Deleted
|
||||||
|
|
||||||
|
- `backend/utils/player-count-checker.js` — Puppeteer audience approach
|
||||||
|
- `backend/utils/room-monitor.js` — REST polling for lock state
|
||||||
|
|
||||||
|
### Files Created
|
||||||
|
|
||||||
|
- `backend/utils/ecast-shard-client.js` — `EcastShardClient` class + module exports: `startMonitor`, `stopMonitor`, `cleanupAllShards`
|
||||||
|
|
||||||
|
### Files Modified
|
||||||
|
|
||||||
|
**`backend/utils/jackbox-api.js`** — Add `getRoomInfo(roomCode)` returning the full room response including `host`, `appTag`, `audienceEnabled`.
|
||||||
|
|
||||||
|
**`backend/routes/sessions.js`** — Replace imports:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Old
|
||||||
|
const { stopPlayerCountCheck } = require('../utils/player-count-checker');
|
||||||
|
const { startRoomMonitor, stopRoomMonitor } = require('../utils/room-monitor');
|
||||||
|
|
||||||
|
// New
|
||||||
|
const { startMonitor, stopMonitor } = require('../utils/ecast-shard-client');
|
||||||
|
```
|
||||||
|
|
||||||
|
All call sites change from two-function calls to one:
|
||||||
|
|
||||||
|
| Route | Old | New |
|
||||||
|
|-------|-----|-----|
|
||||||
|
| `POST /:id/games` (with room_code) | `startRoomMonitor(...)` | `startMonitor(...)` |
|
||||||
|
| `PATCH .../status` (away from playing) | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
|
||||||
|
| `DELETE .../games/:gameId` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
|
||||||
|
| `POST .../start-player-check` | `startRoomMonitor(...)` | `startMonitor(...)` |
|
||||||
|
| `POST .../stop-player-check` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
|
||||||
|
|
||||||
|
Endpoint paths stay the same for backwards compatibility.
|
||||||
|
|
||||||
|
**`backend/server.js`** — Wire `cleanupAllShards()` into `SIGTERM`/`SIGINT` handlers.
|
||||||
|
|
||||||
|
## Error Handling and Reconnection
|
||||||
|
|
||||||
|
### Connection Failures
|
||||||
|
|
||||||
|
1. **REST validation fails** (room not found, network error): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'room_not_found'` or `'connection_failed'`. No automatic retry.
|
||||||
|
|
||||||
|
2. **Shard WebSocket fails to connect**: Retry up to 3 times with exponential backoff (2s, 4s, 8s). On exhaustion, set status `'failed'`, broadcast `room.disconnected` with `reason: 'connection_failed'`.
|
||||||
|
|
||||||
|
3. **Ecast rejects the shard role** (error opcode received): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'role_rejected'`. No retry.
|
||||||
|
|
||||||
|
### Mid-Session Disconnections
|
||||||
|
|
||||||
|
4. **WebSocket closes unexpectedly**: REST check `GET /rooms/{code}`:
|
||||||
|
- Room exists → reconnect with stored `secret`/`id` (up to 3 attempts, exponential backoff). Transparent to clients on success.
|
||||||
|
- Room gone → finalize with last known count, status `'completed'`, broadcast `game.ended` + `room.disconnected`.
|
||||||
|
|
||||||
|
5. **Ecast error 2027 "room already closed"**: Same as room-gone path.
|
||||||
|
|
||||||
|
### Manual Stop
|
||||||
|
|
||||||
|
6. **`stop-player-check` called or game status changes**: Close WebSocket gracefully, set status `'stopped'` (unless already `'completed'`), broadcast `room.disconnected` with `reason: 'manually_stopped'`.
|
||||||
|
|
||||||
|
### Server Shutdown
|
||||||
|
|
||||||
|
7. **`SIGTERM`/`SIGINT`**: `cleanupAllShards()` closes all WebSocket connections. No DB updates on shutdown.
|
||||||
|
|
||||||
|
### State Machine
|
||||||
|
|
||||||
|
```
|
||||||
|
startMonitor()
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌───────────┐
|
||||||
|
┌────────│ not_started│
|
||||||
|
│ └───────────┘
|
||||||
|
│ │
|
||||||
|
REST fails REST succeeds
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌────────┐ ┌────────────┐
|
||||||
|
│ failed │ │ monitoring │◄──── reconnect success
|
||||||
|
└────────┘ └─────┬──────┘
|
||||||
|
▲ │
|
||||||
|
│ ┌────┴─────┬──────────────┐
|
||||||
|
reconnect │ │ │
|
||||||
|
exhausted game ends WS drops manual stop
|
||||||
|
│ │ │ │
|
||||||
|
│ ▼ ▼ ▼
|
||||||
|
│ ┌──────────┐ REST check ┌─────────┐
|
||||||
|
│ │ completed │ │ │ stopped │
|
||||||
|
│ └──────────┘ │ └─────────┘
|
||||||
|
│ │
|
||||||
|
└──── room gone? ────┘
|
||||||
|
│
|
||||||
|
room exists?
|
||||||
|
│
|
||||||
|
reconnect...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Timeouts
|
||||||
|
|
||||||
|
| Concern | Value | Rationale |
|
||||||
|
|---------|-------|-----------|
|
||||||
|
| WebSocket connect timeout | 10s | Ecast servers respond fast |
|
||||||
|
| Reconnect backoff | 2s, 4s, 8s | Three attempts, ~14s total |
|
||||||
|
| Max reconnect attempts | 3 | Fail fast, user can retry manually |
|
||||||
|
| WebSocket inactivity timeout | None | Shard connections receive periodic `shard/sync` CRDT messages |
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
**Added:** `ws` (Node.js WebSocket library) — already a dependency (used by `websocket-manager.js`).
|
||||||
|
|
||||||
|
**Removed:** `puppeteer` — no longer needed for room monitoring.
|
||||||
|
|
||||||
|
## Non-Goals
|
||||||
|
|
||||||
|
- Renaming REST endpoint paths (`start-player-check` / `stop-player-check`) — kept for backwards compatibility
|
||||||
|
- Auto-starting monitoring when room code is set via `PATCH .../room-code` — kept as manual trigger only
|
||||||
|
- Frontend `Picker.jsx` changes — tracked separately (existing bugs: `message.event` vs `message.type`, subscribe without auth, `'waiting'` status that's never set)
|
||||||
Reference in New Issue
Block a user