朱潮 ac8782e1a7 docs(solutions): add SSE error handling solution documentation

Add comprehensive documentation for the SSE stream termination fix:
- Problem analysis and root cause
- Step-by-step solution with code examples
- Security considerations (__debug__ vulnerability)
- Code simplification recommendations
- Prevention strategies and best practices
- Testing and monitoring guidelines

Location: docs/solutions/runtime-errors/sse-mcp-tool-error-handling.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-07 20:25:50 +08:00

12 KiB

Raw Blame History

title

Fix RemoteProtocolError in SSE Streams When MCP Tool Calls Fail

Problem Symptom

When MCP (Model Context Protocol) tool calls failed during agent execution:

The exception was caught and logged in agent_task() but no error message was sent to the client via the SSE (Server-Sent Events) stream
The [DONE] marker was missing from the outer exception handler in enhanced_generate_stream_response()
This caused RemoteProtocolError: incomplete chunked read - the client connection was left hanging expecting more data
Client connections would timeout or drop unexpectedly

Error Messages

httpx.RemoteProtocolError: incomplete chunked read

Observable Behavior

SSE stream would terminate prematurely without sending completion
Client-side: IncompleteRead errors or connection timeouts
Server logs showed the error was caught, but client received no notification

Root Cause Analysis

The issue occurred in the async SSE streaming architecture where:

Inner exception handler (agent_task() in routes/chat.py:116-118):
- Caught exceptions but only logged them
- Did not send error response to client via SSE
- Only sent agent_done signal to output queue
Outer exception handler (routes/chat.py:191-204):
- Caught exceptions in the main stream generator
- Sent error data but missing [DONE] marker
- Left client connection hanging without proper stream termination
MCP tool loading (agent/deep_assistant.py:102-108):
- Used generic logger.info for errors (should be logger.error)
- No traceback for debugging
- No graceful handling at init level

Investigation Steps Tried

Reviewed SSE protocol: Confirmed SSE streams must end with [DONE] marker
Analyzed error flow: Traced exception from agent_task() through output queue to client
Examined similar code: Checked other streaming endpoints for proper patterns
Tested MCP failure scenarios: Simulated tool loading failures to reproduce

Working Solution

1. Send Structured Error Messages in Inner Exception Handler

File: routes/chat.py (lines 116-133)

Before:

except Exception as e:
    logger.error(f"Error in agent task: {e}")
    await output_queue.put(("agent_done", None))

After:

except Exception as e:
    import traceback
    error_details = traceback.format_exc()
    logger.error(f"Error in agent task: {str(e)}")
    logger.error(f"Full traceback: {error_details}")

    # Send error message to client via SSE stream
    error_data = {
        "error": {
            "message": f"Agent execution failed: {str(e)}",
            "type": "agent_error",
            "details": error_details if __debug__ else str(e)
        }
    }
    error_chunk = create_stream_chunk(
        f"chatcmpl-error",
        config.model_name,
        json.dumps(error_data, ensure_ascii=False)
    )
    await output_queue.put(("agent", f"data: {json.dumps(error_chunk, ensure_ascii=False)}\n\n"))
    # Send completion signal to ensure output controller can terminate properly
    await output_queue.put(("agent_done", None))

Why this works:

Client receives structured error response before stream terminates
Error follows same chunk format as normal responses for consistent parsing
Includes traceback in debug mode for troubleshooting
agent_done signal ensures output queue consumer can exit

2. Add [DONE] Marker to Outer Exception Handler

File: routes/chat.py (line 204)

Before:

except Exception as e:
    # ... error handling ...
    yield f"data: {json.dumps(error_data, ensure_ascii=False)}\n\n"
    # Missing: yield "data: [DONE]\n\n"

After:

except Exception as e:
    # ... error handling ...
    yield f"data: {json.dumps(error_data, ensure_ascii=False)}\n\n"
    yield "data: [DONE]\n\n"  # Ensure proper stream termination

Why this works:

SSE protocol requires [DONE] to signal stream end
Without it, client waits indefinitely for more chunks
Prevents RemoteProtocolError: incomplete chunked read

3. Enhance MCP Tool Loading Error Handling

File: agent/deep_assistant.py (lines 102-108)

Before:

except Exception as e:
    # Log at info level, return empty list
    logger.info(f"get_tools_from_mcp: error {e}, elapsed: {time.time() - start_time:.3f}s")
    return []

After:

except Exception as e:
    import traceback
    error_details = traceback.format_exc()
    # Log at ERROR level with full traceback
    logger.error(f"get_tools_from_mcp: error {str(e)}, elapsed: {time.time() - start_time:.3f}s")
    logger.error(f"Full traceback: {error_details}")
    return []

File: agent/deep_assistant.py (lines 142-148)

Added:

try:
    mcp_tools = await get_tools_from_mcp(mcp_settings)
    logger.info(f"Successfully loaded {len(mcp_tools)} MCP tools")
except Exception as e:
    logger.error(f"Failed to load MCP tools: {str(e)}, using empty tool list")
    mcp_tools = []

Why this works:

Prevents cascading failures when MCP tools fail to load
Agent can continue without tools rather than crashing
Better logging aids in debugging root cause

Security Considerations

CRITICAL: `debug` Flag Issue

SECURITY VULNERABILITY: The current implementation uses __debug__ conditional which is always True in normal Python execution (unless Python runs with -O optimization flag).

# CURRENT - VULNERABLE IN PRODUCTION
"details": error_details if __debug__ else str(e)

Impact: Full tracebacks containing:

File paths revealing server directory structure
Library versions and dependency information
Internal variable names and state
Database connection strings (if present in stack trace)
API keys or secrets if logged during initialization

Recommended Fix:

# In utils/settings.py
ENVIRONMENT = os.getenv("ENVIRONMENT", "development")
DEBUG_MODE = ENVIRONMENT != "production"

# In routes/chat.py
from utils.settings import DEBUG_MODE

# Secure implementation
"details": error_details if DEBUG_MODE else "An internal error occurred"

Verification:

# Test current behavior
poetry run python -c "print(__debug__)"
# Output: True (even in production!)

Additional Security Recommendations

Sanitize error messages - Remove sensitive data (passwords, API keys, file paths) before sending to clients
Implement environment-based configuration - Use ENVIRONMENT variable to distinguish production from development
Log security-sensitive events - Track attempted exploits, excessive errors, unusual patterns

Code Simplification Opportunities

Identified Over-Engineering

Nested error structure (lines 85-96): Error type categorization provides no value to clients
- Recommendation: Simplify to {"error": str(e)}
Redundant exception handling (agent/deep_assistant.py:145-150): Wrapper try/except duplicates inner handler logic
- Recommendation: Remove wrapper, let inner handler manage errors
Double JSON serialization: Creates chunk then serializes again for queue
- Recommendation: Direct JSON string construction

Minimal Implementation

# Minimal functional implementation (~15 lines vs ~30 lines)
except Exception as e:
    logger.error(f"Agent error: {e}")
    await output_queue.put(("agent", f'data: {{"error": str(e)}}\n\n'))
    await output_queue.put(("agent_done", None))

Trade-off: Minimal implementation is simpler but provides less debugging context. Choose based on team needs.

Prevention Strategies

Best Practices for SSE Stream Error Handling

Always send error responses before termination

except Exception as e:
    error_chunk = create_error_chunk(e)
    yield f"data: {error_chunk}\n\n"
    yield "data: [DONE]\n\n"

Use structured error responses

error_data = {
    "error": {
        "message": str(e),
        "type": "error_type",
        "details": traceback if debug else None
    }
}

Log at appropriate levels
- Use logger.error() for exceptions (not logger.info())
- Include tracebacks for debugging
- Add context (elapsed time, component name)
Ensure graceful degradation
- Return empty lists instead of crashing
- Allow system to continue with reduced functionality
- Document what happens when components fail

Code Review Checklist for SSE/Streaming Code

Every exception handler sends error response to client
Every exception handler yields [DONE] marker
Error responses follow same format as success responses
Tracebacks logged for debugging (in non-production paths)
Output queues properly signaled on error (_done events)
Async tasks properly cleaned up on exception
File descriptors/connections properly closed

Testing Strategy

Test Cases:

Normal flow: Verify SSE stream ends with [DONE]
MCP tool failure: Simulate tool loading failure, verify graceful handling
Agent execution error: Inject exception during astream(), verify error response
Network timeout: Test slow/failed MCP servers
Concurrent requests: Ensure error handling works under load

Test Commands:

# Test with invalid MCP configuration
curl -X POST http://localhost:8001/api/v2/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "test"}],
    "stream": true,
    "mcp_settings": {"invalid": "config"}
  }'

# Monitor for [DONE] marker in response
# Check server logs for proper error logging

Monitoring Guidelines

Key Metrics:

SSE streams that end without [DONE]
Rate of RemoteProtocolError exceptions
MCP tool loading failures
Agent execution errors by type

Log Patterns:

ERROR:Error in agent task: <error details>
ERROR:Full traceback: <traceback>
ERROR:get_tools_from_mcp: error <error>
ERROR:Failed to load MCP tools: <error>

Alerting:

Alert on spike in RemoteProtocolError
Monitor for streams ending without [DONE]
Track MCP tool loading success rate

SSE Specification
FastAPI StreamingResponse docs
MCP integration docs (internal)

Cross-References

Commit: 8a85e9025e183e80f4370d4251ca0d5af6203a41
Files: routes/chat.py, agent/deep_assistant.py
Issue: N/A (internal fix)

Verification

To verify the fix works:

Check logs for proper error handling:

tail -f logs/app.log | grep -E "(Error in agent task|Full traceback)"

Test SSE stream termination:

curl -N http://localhost:8001/api/v2/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"stream": true, ...}' | grep "\[DONE\]"

Verify error response format:
- Error should have error.type: "agent_error"
- Error should include message and details
- Stream should end with data: [DONE]

12 KiB Raw Blame History

Fix RemoteProtocolError in SSE Streams When MCP Tool Calls Fail

Problem Symptom

Error Messages

Observable Behavior

Root Cause Analysis

Investigation Steps Tried

Working Solution

1. Send Structured Error Messages in Inner Exception Handler

2. Add [DONE] Marker to Outer Exception Handler

3. Enhance MCP Tool Loading Error Handling

Security Considerations

CRITICAL: __debug__ Flag Issue

Additional Security Recommendations

Code Simplification Opportunities

Identified Over-Engineering

Minimal Implementation

Prevention Strategies

Best Practices for SSE Stream Error Handling

Code Review Checklist for SSE/Streaming Code

Testing Strategy

Monitoring Guidelines

Related Documentation

Cross-References

Verification

12 KiB

Raw Blame History

CRITICAL: `debug` Flag Issue