docs/WEBSOCKET_403_ANALYSIS.md

# WebSocket 403 Error Analysis

**Date**: October 31, 2025  
**Issue**: Direct WebSocket connection to `wss://engine.kosmi.io/gql-ws` returns 403 Forbidden

## Tests Performed

### Test 1: No Authentication
```bash
./test-websocket -mode 2
```
**Result**: 403 Forbidden ❌

### Test 2: Origin Header Only
```bash
./test-websocket -mode 3
```
**Result**: 403 Forbidden ❌

### Test 3: With JWT Token
```bash
./test-websocket-direct -token <CAPTURED_TOKEN>
```
**Result**: 403 Forbidden ❌

### Test 4: With Session Cookies + Token
```bash
./test-session -room <URL> -token <TOKEN>
```
**Result**: 403 Forbidden ❌  
**Note**: No cookies were set by visiting the room page

## Analysis

### Why 403?

The 403 error occurs during the **WebSocket handshake**, BEFORE we can send the `connection_init` message with the JWT token. This means:

1. ❌ It's NOT about the JWT token (that's sent after connection)
2. ❌ It's NOT about cookies (no cookies are set)
3. ❌ It's NOT about the Origin header (we're sending the correct origin)
4. ✅ It's likely a security measure at the WebSocket server or proxy level

### Possible Causes

1. **Cloudflare/CDN Protection**
   - Server: "Cowboy" with "Via: 1.1 Caddy"
   - May have bot protection that detects non-browser clients
   - Requires JavaScript challenge or proof-of-work

2. **TLS Fingerprinting**
   - Server may be checking the TLS client hello fingerprint
   - Go's TLS implementation has a different fingerprint than browsers
   - This is commonly used to block bots

3. **WebSocket Sub-protocol Validation**
   - May require specific WebSocket extension headers
   - Browser sends additional headers that we're not replicating

4. **IP-based Rate Limiting**
   - Previous requests from the same IP may have triggered protection
   - Would explain why browser works but our client doesn't

### Evidence from ChromeDP

ChromeDP **DOES work** because:
- It's literally a real Chrome browser
- Has the correct TLS fingerprint
- Passes all JavaScript challenges
- Has complete browser context

## Recommended Solution

### Hybrid Approach: ChromeDP for Token, Native for WebSocket

Since:
1. JWT tokens are valid for **1 year**
2. ChromeDP successfully obtains tokens
3. Native WebSocket cannot bypass 403

**Solution**: Use ChromeDP to get the token once, then cache it:

```go
type TokenCache struct {
    token      string
    expiration time.Time
    mu         sync.RWMutex
}

func (c *TokenCache) Get() (string, error) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    
    if c.token != "" && time.Now().Before(c.expiration) {
        return c.token, nil // Use cached token
    }
    
    // Token expired or missing, get new one via ChromeDP
    return c.refreshToken()
}

func (c *TokenCache) refreshToken() (string, error) {
    c.mu.Lock()
    defer c.mu.Unlock()
    
    // Launch ChromeDP, visit room, extract token
    token := extractTokenViaChromeDPOnce()
    
    // Cache for 11 months (give 1 month buffer)
    c.token = token
    c.expiration = time.Now().Add(11 * 30 * 24 * time.Hour)
    
    return token, nil
}
```

**Benefits**:
- ✅ Only need ChromeDP once per year
- ✅ Native WebSocket for all subsequent connections
- ✅ Lightweight after initial token acquisition
- ✅ Automatic token refresh when expired

## Alternative: Keep ChromeDP

If we can't bypass the 403, we should optimize the ChromeDP approach instead:

1. **Reduce Memory Usage**
   - Use headless-shell instead of full Chrome (~50MB vs ~200MB)
   - Disable unnecessary Chrome features
   - Clean up resources aggressively

2. **Reduce Startup Time**
   - Keep Chrome instance alive between restarts
   - Use Chrome's remote debugging instead of launching new instance

3. **Accept the Trade-off**
   - 200MB RAM is acceptable for a relay service
   - 3-5 second startup is one-time cost
   - It's the most reliable solution

## Next Steps

### Option A: Continue Investigation
- [ ] Try different TLS libraries (crypto/tls alternatives)
- [ ] Analyze browser's exact WebSocket handshake with Wireshark
- [ ] Try mimicking browser's TLS fingerprint
- [ ] Test from different IP addresses

### Option B: Implement Hybrid Solution
- [ ] Extract token from ChromeDP session
- [ ] Implement token caching with expiration
- [ ] Try native WebSocket with cached token
- [ ] Verify if 403 still occurs

### Option C: Optimize ChromeDP
- [ ] Switch to chromedp/headless-shell
- [ ] Implement Chrome instance pooling
- [ ] Optimize memory usage
- [ ] Document performance characteristics

## Recommendation

**Go with Option C**: Optimize ChromeDP

**Reasoning**:
1. ChromeDP is proven to work 100%
2. Token caching won't help if WebSocket still returns 403
3. The 403 is likely permanent without a real browser context
4. Optimization can make ChromeDP acceptable for production
5. ~100MB RAM for a bridge service is reasonable

**Implementation**:
```go
// Use chromedp/headless-shell Docker image
FROM chromedp/headless-shell:latest

// Optimize Chrome flags
chromedp.Flag("disable-gpu", true),
chromedp.Flag("disable-dev-shm-usage", true),
chromedp.Flag("single-process", true),  // Reduce memory
chromedp.Flag("no-zygote", true),       // Reduce memory

// Keep instance alive
func (b *Bkosmi) KeepAlive() {
    // Don't close Chrome between messages
    // Only restart if crashed
}
```

## Conclusion

The 403 Forbidden error is likely a security measure that cannot be easily bypassed without a real browser context. The most pragmatic solution is to **optimize and embrace the ChromeDP approach** rather than trying to reverse engineer the security mechanism.

**Status**: ChromeDP remains the recommended implementation ✅
working v1 2025-10-31 16:17:04 -04:00			`# WebSocket 403 Error Analysis`

			`Date: October 31, 2025`
			Issue: Direct WebSocket connection to `wss://engine.kosmi.io/gql-ws` returns 403 Forbidden

			`## Tests Performed`

			`### Test 1: No Authentication`
			```bash
			`./test-websocket -mode 2`
			```
			`Result: 403 Forbidden ❌`

			`### Test 2: Origin Header Only`
			```bash
			`./test-websocket -mode 3`
			```
			`Result: 403 Forbidden ❌`

			`### Test 3: With JWT Token`
			```bash
			`./test-websocket-direct -token <CAPTURED_TOKEN>`
			```
			`Result: 403 Forbidden ❌`

			`### Test 4: With Session Cookies + Token`
			```bash
			`./test-session -room <URL> -token <TOKEN>`
			```
			`Result: 403 Forbidden ❌`
			`Note: No cookies were set by visiting the room page`

			`## Analysis`

			`### Why 403?`

			The 403 error occurs during the WebSocket handshake, BEFORE we can send the `connection_init` message with the JWT token. This means:

			`1. ❌ It's NOT about the JWT token (that's sent after connection)`
			`2. ❌ It's NOT about cookies (no cookies are set)`
			`3. ❌ It's NOT about the Origin header (we're sending the correct origin)`
			`4. ✅ It's likely a security measure at the WebSocket server or proxy level`

			`### Possible Causes`

			`1. Cloudflare/CDN Protection`
			`- Server: "Cowboy" with "Via: 1.1 Caddy"`
			`- May have bot protection that detects non-browser clients`
			`- Requires JavaScript challenge or proof-of-work`

			`2. TLS Fingerprinting`
			`- Server may be checking the TLS client hello fingerprint`
			`- Go's TLS implementation has a different fingerprint than browsers`
			`- This is commonly used to block bots`

			`3. WebSocket Sub-protocol Validation`
			`- May require specific WebSocket extension headers`
			`- Browser sends additional headers that we're not replicating`

			`4. IP-based Rate Limiting`
			`- Previous requests from the same IP may have triggered protection`
			`- Would explain why browser works but our client doesn't`

			`### Evidence from ChromeDP`

			`ChromeDP DOES work because:`
			`- It's literally a real Chrome browser`
			`- Has the correct TLS fingerprint`
			`- Passes all JavaScript challenges`
			`- Has complete browser context`

			`## Recommended Solution`

			`### Hybrid Approach: ChromeDP for Token, Native for WebSocket`

			`Since:`
			`1. JWT tokens are valid for 1 year`
			`2. ChromeDP successfully obtains tokens`
			`3. Native WebSocket cannot bypass 403`

			`Solution: Use ChromeDP to get the token once, then cache it:`

			```go
			`type TokenCache struct {`
			`token string`
			`expiration time.Time`
			`mu sync.RWMutex`
			`}`

			`func (c *TokenCache) Get() (string, error) {`
			`c.mu.RLock()`
			`defer c.mu.RUnlock()`

			`if c.token != "" && time.Now().Before(c.expiration) {`
			`return c.token, nil // Use cached token`
			`}`

			`// Token expired or missing, get new one via ChromeDP`
			`return c.refreshToken()`
			`}`

			`func (c *TokenCache) refreshToken() (string, error) {`
			`c.mu.Lock()`
			`defer c.mu.Unlock()`

			`// Launch ChromeDP, visit room, extract token`
			`token := extractTokenViaChromeDPOnce()`

			`// Cache for 11 months (give 1 month buffer)`
			`c.token = token`
			`c.expiration = time.Now().Add(11 * 30 * 24 * time.Hour)`

			`return token, nil`
			`}`
			```

			`Benefits:`
			`- ✅ Only need ChromeDP once per year`
			`- ✅ Native WebSocket for all subsequent connections`
			`- ✅ Lightweight after initial token acquisition`
			`- ✅ Automatic token refresh when expired`

			`## Alternative: Keep ChromeDP`

			`If we can't bypass the 403, we should optimize the ChromeDP approach instead:`

			`1. Reduce Memory Usage`
			`- Use headless-shell instead of full Chrome (~50MB vs ~200MB)`
			`- Disable unnecessary Chrome features`
			`- Clean up resources aggressively`

			`2. Reduce Startup Time`
			`- Keep Chrome instance alive between restarts`
			`- Use Chrome's remote debugging instead of launching new instance`

			`3. Accept the Trade-off`
			`- 200MB RAM is acceptable for a relay service`
			`- 3-5 second startup is one-time cost`
			`- It's the most reliable solution`

			`## Next Steps`

			`### Option A: Continue Investigation`
			`- [ ] Try different TLS libraries (crypto/tls alternatives)`
			`- [ ] Analyze browser's exact WebSocket handshake with Wireshark`
			`- [ ] Try mimicking browser's TLS fingerprint`
			`- [ ] Test from different IP addresses`

			`### Option B: Implement Hybrid Solution`
			`- [ ] Extract token from ChromeDP session`
			`- [ ] Implement token caching with expiration`
			`- [ ] Try native WebSocket with cached token`
			`- [ ] Verify if 403 still occurs`

			`### Option C: Optimize ChromeDP`
			`- [ ] Switch to chromedp/headless-shell`
			`- [ ] Implement Chrome instance pooling`
			`- [ ] Optimize memory usage`
			`- [ ] Document performance characteristics`

			`## Recommendation`

			`Go with Option C: Optimize ChromeDP`

			`Reasoning:`
			`1. ChromeDP is proven to work 100%`
			`2. Token caching won't help if WebSocket still returns 403`
			`3. The 403 is likely permanent without a real browser context`
			`4. Optimization can make ChromeDP acceptable for production`
			`5. ~100MB RAM for a bridge service is reasonable`

			`Implementation:`
			```go
			`// Use chromedp/headless-shell Docker image`
			`FROM chromedp/headless-shell:latest`

			`// Optimize Chrome flags`
			`chromedp.Flag("disable-gpu", true),`
			`chromedp.Flag("disable-dev-shm-usage", true),`
			`chromedp.Flag("single-process", true), // Reduce memory`
			`chromedp.Flag("no-zygote", true), // Reduce memory`

			`// Keep instance alive`
			`func (b *Bkosmi) KeepAlive() {`
			`// Don't close Chrome between messages`
			`// Only restart if crashed`
			`}`
			```

			`## Conclusion`

			`The 403 Forbidden error is likely a security measure that cannot be easily bypassed without a real browser context. The most pragmatic solution is to optimize and embrace the ChromeDP approach rather than trying to reverse engineer the security mechanism.`

			`Status: ChromeDP remains the recommended implementation ✅`