Add design doc for ecast shard monitor replacing Puppeteer audience approach
Replaces room-monitor.js (REST polling) and player-count-checker.js (Puppeteer/CDP audience join) with a single EcastShardClient that connects as a shard via direct WebSocket. Defines new event contract, integration points, error handling, and reconnection strategy. Made-with: Cursor
This commit is contained in:
345
docs/plans/2026-03-20-shard-monitor-design.md
Normal file
345
docs/plans/2026-03-20-shard-monitor-design.md
Normal file
@@ -0,0 +1,345 @@
|
||||
# Ecast Shard Monitor — Design Document
|
||||
|
||||
**Date:** 2026-03-20
|
||||
**Status:** Approved
|
||||
**Replaces:** `room-monitor.js` (REST polling for lock) + `player-count-checker.js` (Puppeteer audience join)
|
||||
|
||||
## Problem
|
||||
|
||||
The current player count approach launches a headless Chrome instance via Puppeteer, navigates to `jackbox.tv`, joins as an audience member through the UI, and sniffs WebSocket frames via CDP. This is fragile, resource-heavy, and occupies an audience slot. The room monitor is a separate module that polls the REST API until the room locks, then hands off to the Puppeteer checker. Two modules, two connection strategies, a circular dependency workaround.
|
||||
|
||||
## Solution
|
||||
|
||||
Replace both modules with a single `EcastShardClient` that connects to the Jackbox ecast server as a **shard** via a direct Node.js WebSocket. The shard role:
|
||||
|
||||
- Gets the full `here` map (authoritative player list with names and roles)
|
||||
- Receives real-time entity updates (room state, player joins, game end)
|
||||
- Can query entities via `object/get`
|
||||
- Does NOT count toward `maxPlayers` or trigger `full: true`
|
||||
- Does NOT require a browser
|
||||
|
||||
One REST call upfront validates the room and retrieves the `host` field needed for the WebSocket URL. After that, the shard connection handles everything.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Lifecycle
|
||||
|
||||
```
|
||||
Room code registered
|
||||
│
|
||||
▼
|
||||
REST: GET /rooms/{code} ──── 404 ──→ Mark failed, stop
|
||||
│
|
||||
│ (get host, maxPlayers, locked, appTag)
|
||||
▼
|
||||
WSS: Connect as shard
|
||||
wss://{host}/api/v2/rooms/{code}/play?role=shard&name=GamePicker&userId=gamepicker-{sessionId}&format=json
|
||||
│
|
||||
▼
|
||||
client/welcome received
|
||||
├── Parse `here` → initial player count (filter for `player` roles)
|
||||
├── Parse `entities.room` → lobby state, gameCanStart, etc.
|
||||
├── Store `secret` + `id` for reconnection
|
||||
└── Broadcast initial state to our clients
|
||||
│
|
||||
▼
|
||||
┌─── Event loop (listening for server messages) ───┐
|
||||
│ │
|
||||
│ `object` (key: textDescriptions) │
|
||||
│ → Parse latestDescriptions for player joins │
|
||||
│ → Broadcast `lobby.player-joined` to clients │
|
||||
│ │
|
||||
│ `object` (key: room) │
|
||||
│ → Detect state transitions: │
|
||||
│ lobbyState changes → broadcast lobby updates │
|
||||
│ state: "Gameplay" → broadcast `game.started` │
|
||||
│ gameFinished: true → broadcast `game.ended` │
|
||||
│ gameResults → extract final player count │
|
||||
│ │
|
||||
│ `client/connected` (if delivered to shards) │
|
||||
│ → Update here map, recount players │
|
||||
│ │
|
||||
│ WebSocket close/error │
|
||||
│ → REST check: room exists? │
|
||||
│ Yes → reconnect with secret/id │
|
||||
│ No → game ended, finalize │
|
||||
└────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Internal State
|
||||
|
||||
| Field | Type | Source |
|
||||
|-------|------|--------|
|
||||
| `playerCount` | number | `here` map filtered for `player` roles |
|
||||
| `playerNames` | string[] | `here` map player role `name` fields |
|
||||
| `lobbyState` | string | `room` entity `lobbyState` |
|
||||
| `gameState` | string | `room` entity `state` (`"Lobby"`, `"Gameplay"`) |
|
||||
| `gameStarted` | boolean | Derived from `state === "Gameplay"` |
|
||||
| `gameFinished` | boolean | `room` entity `gameFinished` |
|
||||
| `maxPlayers` | number | REST response + `room` entity |
|
||||
| `secret` / `id` | string/number | `client/welcome` for reconnection |
|
||||
|
||||
### Player Counting
|
||||
|
||||
The `here` map from `client/welcome` is the authoritative source. It lists all registered connections with their roles. Count entries where `roles` contains `player`. The shard itself is excluded (it has `roles: {shard: {}}`). The host (ID 1, `roles: {host: {}}`) is also excluded. Since Jackbox holds slots for disconnected players, `here` always reflects the true occupied slot count.
|
||||
|
||||
For subsequent joins after connect, `textDescriptions` entity updates provide join notifications. Since shards have `here` visibility, `client/connected` messages may also be delivered — both paths are handled, with `here` as source of truth.
|
||||
|
||||
## WebSocket Events (Game Picker → Connected Clients)
|
||||
|
||||
### `room.connected`
|
||||
|
||||
Shard successfully connected to the Jackbox room. Sent once on initial connect. Replaces the old `audience.joined` event.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "room.connected",
|
||||
"timestamp": "...",
|
||||
"data": {
|
||||
"sessionId": 1,
|
||||
"gameId": 5,
|
||||
"roomCode": "LSBN",
|
||||
"appTag": "drawful2international",
|
||||
"maxPlayers": 8,
|
||||
"playerCount": 2,
|
||||
"players": ["Alice", "Bob"],
|
||||
"lobbyState": "CanStart",
|
||||
"gameState": "Lobby"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `lobby.player-joined`
|
||||
|
||||
A new player joined the lobby.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "lobby.player-joined",
|
||||
"timestamp": "...",
|
||||
"data": {
|
||||
"sessionId": 1,
|
||||
"gameId": 5,
|
||||
"roomCode": "LSBN",
|
||||
"playerName": "Charlie",
|
||||
"playerCount": 3,
|
||||
"players": ["Alice", "Bob", "Charlie"],
|
||||
"maxPlayers": 8
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `lobby.updated`
|
||||
|
||||
Lobby state changed (enough players to start, countdown started, etc.).
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "lobby.updated",
|
||||
"timestamp": "...",
|
||||
"data": {
|
||||
"sessionId": 1,
|
||||
"gameId": 5,
|
||||
"roomCode": "LSBN",
|
||||
"lobbyState": "Countdown",
|
||||
"gameCanStart": true,
|
||||
"gameIsStarting": true,
|
||||
"playerCount": 4
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `game.started`
|
||||
|
||||
The game transitioned from Lobby to Gameplay.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "game.started",
|
||||
"timestamp": "...",
|
||||
"data": {
|
||||
"sessionId": 1,
|
||||
"gameId": 5,
|
||||
"roomCode": "LSBN",
|
||||
"playerCount": 4,
|
||||
"players": ["Alice", "Bob", "Charlie", "Diana"],
|
||||
"maxPlayers": 8
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `game.ended`
|
||||
|
||||
The game finished.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "game.ended",
|
||||
"timestamp": "...",
|
||||
"data": {
|
||||
"sessionId": 1,
|
||||
"gameId": 5,
|
||||
"roomCode": "LSBN",
|
||||
"playerCount": 4,
|
||||
"players": ["Alice", "Bob", "Charlie", "Diana"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `room.disconnected`
|
||||
|
||||
Shard lost connection to the Jackbox room.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "room.disconnected",
|
||||
"timestamp": "...",
|
||||
"data": {
|
||||
"sessionId": 1,
|
||||
"gameId": 5,
|
||||
"roomCode": "LSBN",
|
||||
"reason": "room_closed",
|
||||
"finalPlayerCount": 4
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Possible `reason` values: `room_closed`, `room_not_found`, `connection_failed`, `role_rejected`, `manually_stopped`.
|
||||
|
||||
### Dropped Events
|
||||
|
||||
| Old event | Replacement |
|
||||
|-----------|-------------|
|
||||
| `audience.joined` | `room.connected` (richer payload) |
|
||||
| `player-count.updated` (automated) | `lobby.player-joined`, `game.started`, `game.ended` carry `playerCount` |
|
||||
|
||||
The manual `PATCH .../player-count` endpoint keeps broadcasting `player-count.updated` for its specific use case.
|
||||
|
||||
### DB Persistence
|
||||
|
||||
The `session_games` table columns `player_count` and `player_count_check_status` continue to be updated:
|
||||
|
||||
- `player_count` — updated on each join and at game end
|
||||
- `player_count_check_status` — `'monitoring'` (shard connected), `'completed'` (game ended with count), `'failed'` (couldn't connect), `'stopped'` (manual stop)
|
||||
|
||||
The old `'checking'` status becomes `'monitoring'`.
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Files Deleted
|
||||
|
||||
- `backend/utils/player-count-checker.js` — Puppeteer audience approach
|
||||
- `backend/utils/room-monitor.js` — REST polling for lock state
|
||||
|
||||
### Files Created
|
||||
|
||||
- `backend/utils/ecast-shard-client.js` — `EcastShardClient` class + module exports: `startMonitor`, `stopMonitor`, `cleanupAllShards`
|
||||
|
||||
### Files Modified
|
||||
|
||||
**`backend/utils/jackbox-api.js`** — Add `getRoomInfo(roomCode)` returning the full room response including `host`, `appTag`, `audienceEnabled`.
|
||||
|
||||
**`backend/routes/sessions.js`** — Replace imports:
|
||||
|
||||
```javascript
|
||||
// Old
|
||||
const { stopPlayerCountCheck } = require('../utils/player-count-checker');
|
||||
const { startRoomMonitor, stopRoomMonitor } = require('../utils/room-monitor');
|
||||
|
||||
// New
|
||||
const { startMonitor, stopMonitor } = require('../utils/ecast-shard-client');
|
||||
```
|
||||
|
||||
All call sites change from two-function calls to one:
|
||||
|
||||
| Route | Old | New |
|
||||
|-------|-----|-----|
|
||||
| `POST /:id/games` (with room_code) | `startRoomMonitor(...)` | `startMonitor(...)` |
|
||||
| `PATCH .../status` (away from playing) | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
|
||||
| `DELETE .../games/:gameId` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
|
||||
| `POST .../start-player-check` | `startRoomMonitor(...)` | `startMonitor(...)` |
|
||||
| `POST .../stop-player-check` | `stopRoomMonitor(...) + stopPlayerCountCheck(...)` | `stopMonitor(...)` |
|
||||
|
||||
Endpoint paths stay the same for backwards compatibility.
|
||||
|
||||
**`backend/server.js`** — Wire `cleanupAllShards()` into `SIGTERM`/`SIGINT` handlers.
|
||||
|
||||
## Error Handling and Reconnection
|
||||
|
||||
### Connection Failures
|
||||
|
||||
1. **REST validation fails** (room not found, network error): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'room_not_found'` or `'connection_failed'`. No automatic retry.
|
||||
|
||||
2. **Shard WebSocket fails to connect**: Retry up to 3 times with exponential backoff (2s, 4s, 8s). On exhaustion, set status `'failed'`, broadcast `room.disconnected` with `reason: 'connection_failed'`.
|
||||
|
||||
3. **Ecast rejects the shard role** (error opcode received): Set status `'failed'`, broadcast `room.disconnected` with `reason: 'role_rejected'`. No retry.
|
||||
|
||||
### Mid-Session Disconnections
|
||||
|
||||
4. **WebSocket closes unexpectedly**: REST check `GET /rooms/{code}`:
|
||||
- Room exists → reconnect with stored `secret`/`id` (up to 3 attempts, exponential backoff). Transparent to clients on success.
|
||||
- Room gone → finalize with last known count, status `'completed'`, broadcast `game.ended` + `room.disconnected`.
|
||||
|
||||
5. **Ecast error 2027 "room already closed"**: Same as room-gone path.
|
||||
|
||||
### Manual Stop
|
||||
|
||||
6. **`stop-player-check` called or game status changes**: Close WebSocket gracefully, set status `'stopped'` (unless already `'completed'`), broadcast `room.disconnected` with `reason: 'manually_stopped'`.
|
||||
|
||||
### Server Shutdown
|
||||
|
||||
7. **`SIGTERM`/`SIGINT`**: `cleanupAllShards()` closes all WebSocket connections. No DB updates on shutdown.
|
||||
|
||||
### State Machine
|
||||
|
||||
```
|
||||
startMonitor()
|
||||
│
|
||||
▼
|
||||
┌───────────┐
|
||||
┌────────│ not_started│
|
||||
│ └───────────┘
|
||||
│ │
|
||||
REST fails REST succeeds
|
||||
│ │
|
||||
▼ ▼
|
||||
┌────────┐ ┌────────────┐
|
||||
│ failed │ │ monitoring │◄──── reconnect success
|
||||
└────────┘ └─────┬──────┘
|
||||
▲ │
|
||||
│ ┌────┴─────┬──────────────┐
|
||||
reconnect │ │ │
|
||||
exhausted game ends WS drops manual stop
|
||||
│ │ │ │
|
||||
│ ▼ ▼ ▼
|
||||
│ ┌──────────┐ REST check ┌─────────┐
|
||||
│ │ completed │ │ │ stopped │
|
||||
│ └──────────┘ │ └─────────┘
|
||||
│ │
|
||||
└──── room gone? ────┘
|
||||
│
|
||||
room exists?
|
||||
│
|
||||
reconnect...
|
||||
```
|
||||
|
||||
### Timeouts
|
||||
|
||||
| Concern | Value | Rationale |
|
||||
|---------|-------|-----------|
|
||||
| WebSocket connect timeout | 10s | Ecast servers respond fast |
|
||||
| Reconnect backoff | 2s, 4s, 8s | Three attempts, ~14s total |
|
||||
| Max reconnect attempts | 3 | Fail fast, user can retry manually |
|
||||
| WebSocket inactivity timeout | None | Shard connections receive periodic `shard/sync` CRDT messages |
|
||||
|
||||
## Dependencies
|
||||
|
||||
**Added:** `ws` (Node.js WebSocket library) — already a dependency (used by `websocket-manager.js`).
|
||||
|
||||
**Removed:** `puppeteer` — no longer needed for room monitoring.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Renaming REST endpoint paths (`start-player-check` / `stop-player-check`) — kept for backwards compatibility
|
||||
- Auto-starting monitoring when room code is set via `PATCH .../room-code` — kept as manual trigger only
|
||||
- Frontend `Picker.jsx` changes — tracked separately (existing bugs: `message.event` vs `message.type`, subscribe without auth, `'waiting'` status that's never set)
|
||||
Reference in New Issue
Block a user