Files
low-code-react-app-b/docs/deployment/BAD_GATEWAY_FIX.md
2026-01-17 00:33:02 +00:00

118 lines
4.8 KiB
Markdown

# Bad Gateway Errors - Fixed
## Problem
The application was experiencing masses of "Bad Gateway" (502) errors caused by excessive LLM API calls.
## Root Causes Identified
1. **Auto-scanning running every 2 seconds** - The `useAutoRepair` hook was automatically scanning all files for errors every 2 seconds, making continuous LLM calls
2. **No rate limiting** - Multiple AI features (component generation, code improvement, error repair, etc.) were making unlimited concurrent LLM requests
3. **No error circuit breaker** - Failed requests would retry immediately without backing off
4. **No request throttling** - All AI operations competed for the same gateway resources
## Solutions Implemented
### 1. Rate Limiting System (`src/lib/rate-limiter.ts`)
- **Per-category rate limiting**: Different limits for different AI operations
- **Time windows**: Tracks requests over rolling 60-second windows
- **Automatic cleanup**: Removes stale tracking data
- **Priority queue support**: High-priority requests can retry with backoff
- **Status tracking**: Monitor remaining capacity and reset times
Configuration:
- **AI Operations**: Max 3 requests per minute
- **Error Scanning**: Max 1 request per 30 seconds
### 2. Protected LLM Service (`src/lib/protected-llm-service.ts`)
- **Error tracking**: Monitors consecutive failures
- **Circuit breaker**: Pauses all requests after 5 consecutive errors
- **User-friendly error messages**: Converts technical errors to actionable messages
- **Automatic recovery**: Error count decreases on successful calls
- **Request categorization**: Groups related operations for better rate limiting
### 3. Disabled Automatic Scanning
- **Removed automatic useEffect trigger** in `useAutoRepair`
- **Manual scanning only**: Users must explicitly click "Scan" button
- **Rate-limited when triggered**: Even manual scans respect rate limits
### 4. Updated All AI Services
- **ai-service.ts**: All methods now use `ProtectedLLMService`
- **error-repair-service.ts**: Code repair uses rate limiting
- **Consistent error handling**: All services handle 502/429 errors gracefully
## Benefits
1. **No more cascading failures**: Rate limiting prevents overwhelming the gateway
2. **Better user experience**: Clear error messages explain what went wrong
3. **Automatic recovery**: Circuit breaker allows system to recover from issues
4. **Resource efficiency**: Prevents wasted requests that would fail anyway
5. **Predictable behavior**: Users understand when operations might be delayed
## How It Works Now
### Normal Operation
1. User triggers an AI feature (generate component, improve code, etc.)
2. Request goes through `ProtectedLLMService`
3. Rate limiter checks if request is allowed
4. If allowed, request proceeds
5. If rate-limited, user sees friendly message about slowing down
### Error Handling
1. If LLM call fails with 502/Bad Gateway:
- User sees: "Service temporarily unavailable - please wait a moment"
- Error count increases
- Request is blocked by rate limiter for the category
2. If too many consecutive errors (5+):
- Circuit breaker trips
- All AI operations pause
- User sees: "AI service temporarily unavailable due to repeated errors"
3. Recovery:
- Successful requests decrease error count
- After error count drops, circuit breaker resets
- Normal operation resumes
### Manual Controls
Users can check AI service status:
```javascript
const stats = ProtectedLLMService.getStats()
// Returns: { totalCalls, errorCount, isPaused }
```
Users can manually reset if needed:
```javascript
ProtectedLLMService.reset()
// Clears all rate limits and error counts
```
## Testing the Fix
1. **Verify no automatic scanning**: Open the app - no LLM calls should fire automatically
2. **Test rate limiting**: Try generating 5 components quickly - should see rate limit message
3. **Test error recovery**: If you hit an error, next successful call should work
4. **Check manual scan**: Error panel scan button should work with rate limiting
## Monitoring
Watch the browser console for:
- `LLM call failed (category): error` - Individual failures
- `Rate limit exceeded for llm-category` - Rate limiting in action
- `Too many LLM errors detected` - Circuit breaker activation
## Future Improvements
1. **Retry queue**: Queue rate-limited requests and auto-retry
2. **Progressive backoff**: Increase delays after repeated failures
3. **Request deduplication**: Prevent identical simultaneous requests
4. **Usage analytics**: Track which features use most AI calls
5. **User quotas**: Per-user rate limiting for multi-tenant deployments
## Files Modified
- `/src/lib/rate-limiter.ts` (NEW)
- `/src/lib/protected-llm-service.ts` (NEW)
- `/src/lib/ai-service.ts` (UPDATED - now uses rate limiting)
- `/src/lib/error-repair-service.ts` (UPDATED - now uses rate limiting)
- `/src/hooks/use-auto-repair.ts` (UPDATED - disabled automatic scanning)