Key Takeaway
Every AI system should have a defined graceful degradation path that provides reduced but functional service when the primary model or provider is unavailable. This guide covers five AI-specific disaster scenarios with RTO/RPO targets, failover procedures, and recovery runbooks for each.
Prerequisites
- An existing business continuity or disaster recovery framework
- Inventory of all AI systems with their dependencies (providers, models, data stores, infrastructure)
- Model versioning and artifact storage with backup capabilities
- Understanding of which AI features are business-critical vs. best-effort
- On-call procedures and incident management infrastructure
AI Disaster Scenarios Are Different
Traditional disaster recovery plans assume that recovery means restoring the same application to the same state. AI disaster recovery adds scenarios that have no traditional equivalent: your model provider goes down (your code is fine but the model is unreachable), your model is corrupted (the application is running but producing wrong results), your training data is lost (the running model still works but you cannot retrain it), or GPU availability evaporates (training and scaling become impossible even though current serving continues). Each scenario requires different recovery strategies and different RTO/RPO targets.
RTO/RPO Framework for AI Systems
RTO (Recovery Time Objective) and RPO (Recovery Point Objective) must be defined separately for model serving and model training. Serving RTO is how long you can be without any AI inference capability. Training RPO is how much training data and progress you can afford to lose. These targets vary by system criticality: a customer-facing recommendation engine has different RTO requirements than an internal analytics model.
| Scenario | Serving RTO | Training RPO | Primary Strategy | Fallback Strategy |
|---|---|---|---|---|
| LLM Provider Outage | < 5 minutes | N/A (no training loss) | Automatic failover to secondary provider | Serve cached responses + graceful degradation |
| Model Corruption | < 15 minutes | Last known good model version | Automated rollback to previous model version | Emergency retraining from checkpoint |
| Training Data Loss | No serving impact | < 24 hours of training data | Restore from backup storage | Rebuild from source systems with lineage records |
| Infrastructure Failure | < 30 minutes | Last checkpoint | Multi-region failover | Cloud provider migration to backup region |
| Cascading Failure | < 5 minutes (per component) | N/A | Circuit breakers + dependency isolation | Load shedding: disable non-critical AI features |
Multi-Provider Failover
For organizations using third-party LLM APIs, provider outages are the most likely disaster scenario. Provider outages are not rare: every major LLM provider has experienced multi-hour outages. Multi-provider failover maintains integrations with at least two LLM providers and automatically routes traffic to the secondary provider when the primary is unavailable. The key challenge is maintaining prompt and evaluation parity across providers, since the same prompt may produce different quality results on different models.
/**
* Multi-provider LLM failover.
*
* Automatically routes requests to a backup LLM provider
* when the primary provider is unavailable or degraded.
*/
interface ProviderConfig {
name: string;
model: string;
endpoint: string;
apiKey: string;
priority: number; // Lower = higher priority
healthCheckUrl: string;
maxLatencyMs: number;
}
interface FailoverState {
currentProvider: string;
failoverActive: boolean;
failoverStartedAt: number | null;
consecutiveFailures: Map<string, number>;
}
class LLMFailoverRouter {
private state: FailoverState;
private readonly failureThreshold = 3;
private readonly recoveryCheckIntervalMs = 30000; // 30 sec
constructor(
private readonly providers: ProviderConfig[],
) {
this.providers.sort((a, b) => a.priority - b.priority);
this.state = {
currentProvider: providers[0].name,
failoverActive: false,
failoverStartedAt: null,
consecutiveFailures: new Map(),
};
}
async route(
request: { prompt: string; maxTokens: number },
): Promise<{ response: string; provider: string }> {
// Try providers in priority order
for (const provider of this.providers) {
try {
const response = await this.callProvider(
provider, request,
);
// Reset failure count on success
this.state.consecutiveFailures.set(
provider.name, 0,
);
return { response, provider: provider.name };
} catch (error) {
this.recordFailure(provider.name);
// Try next provider
continue;
}
}
// All providers failed -- serve cached response if available
throw new Error(
"All LLM providers unavailable. "
+ "Activate cached response serving.",
);
}
private recordFailure(providerName: string): void {
const current =
this.state.consecutiveFailures.get(providerName) || 0;
this.state.consecutiveFailures.set(
providerName, current + 1,
);
if (current + 1 >= this.failureThreshold) {
console.warn(
`Provider ${providerName} marked unhealthy after `
+ `${current + 1} consecutive failures`,
);
}
}
private async callProvider(
provider: ProviderConfig,
request: { prompt: string; maxTokens: number },
): Promise<string> {
// Check if provider is healthy
const failures =
this.state.consecutiveFailures.get(provider.name) || 0;
if (failures >= this.failureThreshold) {
throw new Error(
`Provider ${provider.name} is unhealthy`,
);
}
// Implement actual API call here
// This is a placeholder for the provider-specific logic
const response = await fetch(provider.endpoint, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${provider.apiKey}`,
},
body: JSON.stringify(request),
signal: AbortSignal.timeout(provider.maxLatencyMs),
});
if (!response.ok) {
throw new Error(`Provider returned ${response.status}`);
}
const data = await response.json();
return data.content || data.choices?.[0]?.message?.content;
}
}Graceful Degradation Patterns
Graceful degradation defines what the user experience looks like when AI capabilities are reduced or unavailable. Rather than showing an error page, the application provides reduced functionality using non-AI fallbacks. A recommendation system falls back to popularity-based recommendations. A search system falls back to keyword matching. A content generation system shows a message explaining that the feature is temporarily unavailable. Every AI feature should have a defined degradation path documented before it launches.
- 1
Tier 1: Model Fallback
Switch to a simpler, locally-hosted model that provides lower quality but maintains the AI-powered experience. Use a fine-tuned smaller model or a rule-based system. This is the preferred degradation tier because users still get an AI experience.
- 2
Tier 2: Cached Response Serving
Serve cached responses for common queries. Works well for applications with repetitive query patterns. Combine with a 'results may be from cache' indicator so users understand that responses may not reflect the latest data.
- 3
Tier 3: Non-AI Fallback
Fall back to a non-AI implementation: keyword search instead of semantic search, rule-based recommendations instead of ML recommendations, manual processes instead of automated classification. This tier preserves functionality but loses the AI quality advantage.
- 4
Tier 4: Feature Disabled
Disable the AI feature entirely and show a user-friendly message. This is the last resort for features where no non-AI fallback exists or where a poor fallback would be worse than no feature at all.
Model Version Management
Model version management is the foundation of AI disaster recovery. Without the ability to quickly rollback to a previous model version, any model corruption or quality degradation becomes a protracted incident. Your model registry must maintain at least the last three production model versions in a deployment-ready state, with the ability to promote any version to production within minutes. Treat model versions like database backups: test your restoration process regularly.
Test your disaster recovery procedures regularly. A multi-provider failover that has never been tested in production will almost certainly not work correctly when you need it. Schedule quarterly DR drills that simulate each failure scenario. Time the recovery and identify bottlenecks. Update procedures based on what you learn.
Failover Infrastructure
Data and Model Recovery
Version History
1.0.0 · 2026-03-01
- • Initial release with five AI disaster scenarios and RTO/RPO framework
- • Multi-provider LLM failover implementation in TypeScript
- • Four-tier graceful degradation pattern for AI features
- • Model version management and rollback guidance
- • DR infrastructure and recovery readiness checklists