Files
jackboxpartypack-gamepicker/docs/plans/2026-03-20-shard-monitor-design.md
cottongin 002e1d70a6 Add design doc for ecast shard monitor replacing Puppeteer audience approach
Replaces room-monitor.js (REST polling) and player-count-checker.js
(Puppeteer/CDP audience join) with a single EcastShardClient that
connects as a shard via direct WebSocket. Defines new event contract,
integration points, error handling, and reconnection strategy.

Made-with: Cursor
2026-03-20 10:42:33 -04:00

13 KiB

Ecast Shard Monitor — Design Document

Date: 2026-03-20 Status: Approved Replaces: room-monitor.js (REST polling for lock) + player-count-checker.js (Puppeteer audience join)

Problem

The current player count approach launches a headless Chrome instance via Puppeteer, navigates to jackbox.tv, joins as an audience member through the UI, and sniffs WebSocket frames via CDP. This is fragile, resource-heavy, and occupies an audience slot. The room monitor is a separate module that polls the REST API until the room locks, then hands off to the Puppeteer checker. Two modules, two connection strategies, a circular dependency workaround.

Solution

Replace both modules with a single EcastShardClient that connects to the Jackbox ecast server as a shard via a direct Node.js WebSocket. The shard role:

  • Gets the full here map (authoritative player list with names and roles)
  • Receives real-time entity updates (room state, player joins, game end)
  • Can query entities via object/get
  • Does NOT count toward maxPlayers or trigger full: true
  • Does NOT require a browser

One REST call upfront validates the room and retrieves the host field needed for the WebSocket URL. After that, the shard connection handles everything.

Architecture

Lifecycle

Room code registered
        │
        ▼
  REST: GET /rooms/{code}  ──── 404 ──→ Mark failed, stop
        │
        │ (get host, maxPlayers, locked, appTag)
        ▼
  WSS: Connect as shard
  wss://{host}/api/v2/rooms/{code}/play?role=shard&name=GamePicker&userId=gamepicker-{sessionId}&format=json
        │
        ▼
  client/welcome received
  ├── Parse `here` → initial player count (filter for `player` roles)
  ├── Parse `entities.room` → lobby state, gameCanStart, etc.
  ├── Store `secret` + `id` for reconnection
  └── Broadcast initial state to our clients
        │
        ▼
  ┌─── Event loop (listening for server messages) ───┐
  │                                                    │
  │  `object` (key: textDescriptions)                  │
  │    → Parse latestDescriptions for player joins     │
  │    → Broadcast `lobby.player-joined` to clients    │
  │                                                    │
  │  `object` (key: room)                              │
  │    → Detect state transitions:                     │
  │      lobbyState changes → broadcast lobby updates  │
  │      state: "Gameplay" → broadcast `game.started`  │
  │      gameFinished: true → broadcast `game.ended`   │
  │      gameResults → extract final player count      │
  │                                                    │
  │  `client/connected` (if delivered to shards)       │
  │    → Update here map, recount players              │
  │                                                    │
  │  WebSocket close/error                             │
  │    → REST check: room exists?                      │
  │      Yes → reconnect with secret/id                │
  │      No  → game ended, finalize                    │
  └────────────────────────────────────────────────────┘

Internal State

Field Type Source
playerCount number here map filtered for player roles
playerNames string[] here map player role name fields
lobbyState string room entity lobbyState
gameState string room entity state ("Lobby", "Gameplay")
gameStarted boolean Derived from state === "Gameplay"
gameFinished boolean room entity gameFinished
maxPlayers number REST response + room entity
secret / id string/number client/welcome for reconnection

Player Counting

The here map from client/welcome is the authoritative source. It lists all registered connections with their roles. Count entries where roles contains player. The shard itself is excluded (it has roles: {shard: {}}). The host (ID 1, roles: {host: {}}) is also excluded. Since Jackbox holds slots for disconnected players, here always reflects the true occupied slot count.

For subsequent joins after connect, textDescriptions entity updates provide join notifications. Since shards have here visibility, client/connected messages may also be delivered — both paths are handled, with here as source of truth.

WebSocket Events (Game Picker → Connected Clients)

room.connected

Shard successfully connected to the Jackbox room. Sent once on initial connect. Replaces the old audience.joined event.

{
  "type": "room.connected",
  "timestamp": "...",
  "data": {
    "sessionId": 1,
    "gameId": 5,
    "roomCode": "LSBN",
    "appTag": "drawful2international",
    "maxPlayers": 8,
    "playerCount": 2,
    "players": ["Alice", "Bob"],
    "lobbyState": "CanStart",
    "gameState": "Lobby"
  }
}

lobby.player-joined

A new player joined the lobby.

{
  "type": "lobby.player-joined",
  "timestamp": "...",
  "data": {
    "sessionId": 1,
    "gameId": 5,
    "roomCode": "LSBN",
    "playerName": "Charlie",
    "playerCount": 3,
    "players": ["Alice", "Bob", "Charlie"],
    "maxPlayers": 8
  }
}

lobby.updated

Lobby state changed (enough players to start, countdown started, etc.).

{
  "type": "lobby.updated",
  "timestamp": "...",
  "data": {
    "sessionId": 1,
    "gameId": 5,
    "roomCode": "LSBN",
    "lobbyState": "Countdown",
    "gameCanStart": true,
    "gameIsStarting": true,
    "playerCount": 4
  }
}

game.started

The game transitioned from Lobby to Gameplay.

{
  "type": "game.started",
  "timestamp": "...",
  "data": {
    "sessionId": 1,
    "gameId": 5,
    "roomCode": "LSBN",
    "playerCount": 4,
    "players": ["Alice", "Bob", "Charlie", "Diana"],
    "maxPlayers": 8
  }
}

game.ended

The game finished.

{
  "type": "game.ended",
  "timestamp": "...",
  "data": {
    "sessionId": 1,
    "gameId": 5,
    "roomCode": "LSBN",
    "playerCount": 4,
    "players": ["Alice", "Bob", "Charlie", "Diana"]
  }
}

room.disconnected

Shard lost connection to the Jackbox room.

{
  "type": "room.disconnected",
  "timestamp": "...",
  "data": {
    "sessionId": 1,
    "gameId": 5,
    "roomCode": "LSBN",
    "reason": "room_closed",
    "finalPlayerCount": 4
  }
}

Possible reason values: room_closed, room_not_found, connection_failed, role_rejected, manually_stopped.

Dropped Events

Old event Replacement
audience.joined room.connected (richer payload)
player-count.updated (automated) lobby.player-joined, game.started, game.ended carry playerCount

The manual PATCH .../player-count endpoint keeps broadcasting player-count.updated for its specific use case.

DB Persistence

The session_games table columns player_count and player_count_check_status continue to be updated:

  • player_count — updated on each join and at game end
  • player_count_check_status'monitoring' (shard connected), 'completed' (game ended with count), 'failed' (couldn't connect), 'stopped' (manual stop)

The old 'checking' status becomes 'monitoring'.

Integration Points

Files Deleted

  • backend/utils/player-count-checker.js — Puppeteer audience approach
  • backend/utils/room-monitor.js — REST polling for lock state

Files Created

  • backend/utils/ecast-shard-client.jsEcastShardClient class + module exports: startMonitor, stopMonitor, cleanupAllShards

Files Modified

backend/utils/jackbox-api.js — Add getRoomInfo(roomCode) returning the full room response including host, appTag, audienceEnabled.

backend/routes/sessions.js — Replace imports:

// Old
const { stopPlayerCountCheck } = require('../utils/player-count-checker');
const { startRoomMonitor, stopRoomMonitor } = require('../utils/room-monitor');

// New
const { startMonitor, stopMonitor } = require('../utils/ecast-shard-client');

All call sites change from two-function calls to one:

Route Old New
POST /:id/games (with room_code) startRoomMonitor(...) startMonitor(...)
PATCH .../status (away from playing) stopRoomMonitor(...) + stopPlayerCountCheck(...) stopMonitor(...)
DELETE .../games/:gameId stopRoomMonitor(...) + stopPlayerCountCheck(...) stopMonitor(...)
POST .../start-player-check startRoomMonitor(...) startMonitor(...)
POST .../stop-player-check stopRoomMonitor(...) + stopPlayerCountCheck(...) stopMonitor(...)

Endpoint paths stay the same for backwards compatibility.

backend/server.js — Wire cleanupAllShards() into SIGTERM/SIGINT handlers.

Error Handling and Reconnection

Connection Failures

  1. REST validation fails (room not found, network error): Set status 'failed', broadcast room.disconnected with reason: 'room_not_found' or 'connection_failed'. No automatic retry.

  2. Shard WebSocket fails to connect: Retry up to 3 times with exponential backoff (2s, 4s, 8s). On exhaustion, set status 'failed', broadcast room.disconnected with reason: 'connection_failed'.

  3. Ecast rejects the shard role (error opcode received): Set status 'failed', broadcast room.disconnected with reason: 'role_rejected'. No retry.

Mid-Session Disconnections

  1. WebSocket closes unexpectedly: REST check GET /rooms/{code}:

    • Room exists → reconnect with stored secret/id (up to 3 attempts, exponential backoff). Transparent to clients on success.
    • Room gone → finalize with last known count, status 'completed', broadcast game.ended + room.disconnected.
  2. Ecast error 2027 "room already closed": Same as room-gone path.

Manual Stop

  1. stop-player-check called or game status changes: Close WebSocket gracefully, set status 'stopped' (unless already 'completed'), broadcast room.disconnected with reason: 'manually_stopped'.

Server Shutdown

  1. SIGTERM/SIGINT: cleanupAllShards() closes all WebSocket connections. No DB updates on shutdown.

State Machine

                    startMonitor()
                         │
                         ▼
                   ┌───────────┐
         ┌────────│ not_started│
         │        └───────────┘
         │              │
    REST fails     REST succeeds
         │              │
         ▼              ▼
    ┌────────┐    ┌────────────┐
    │ failed │    │ monitoring │◄──── reconnect success
    └────────┘    └─────┬──────┘
         ▲              │
         │         ┌────┴─────┬──────────────┐
    reconnect      │          │              │
    exhausted   game ends   WS drops      manual stop
         │         │          │              │
         │         ▼          ▼              ▼
         │   ┌──────────┐  REST check  ┌─────────┐
         │   │ completed │     │        │ stopped │
         │   └──────────┘     │        └─────────┘
         │                    │
         └──── room gone? ────┘
                   │
              room exists?
                   │
              reconnect...

Timeouts

Concern Value Rationale
WebSocket connect timeout 10s Ecast servers respond fast
Reconnect backoff 2s, 4s, 8s Three attempts, ~14s total
Max reconnect attempts 3 Fail fast, user can retry manually
WebSocket inactivity timeout None Shard connections receive periodic shard/sync CRDT messages

Dependencies

Added: ws (Node.js WebSocket library) — already a dependency (used by websocket-manager.js).

Removed: puppeteer — no longer needed for room monitoring.

Non-Goals

  • Renaming REST endpoint paths (start-player-check / stop-player-check) — kept for backwards compatibility
  • Auto-starting monitoring when room code is set via PATCH .../room-code — kept as manual trigger only
  • Frontend Picker.jsx changes — tracked separately (existing bugs: message.event vs message.type, subscribe without auth, 'waiting' status that's never set)