Key Takeaway

Every AI system should have a defined graceful degradation path that provides reduced but functional service when the primary model or provider is unavailable. This guide covers five AI-specific disaster scenarios with RTO/RPO targets, failover procedures, and recovery runbooks for each.

Prerequisites

An existing business continuity or disaster recovery framework
Inventory of all AI systems with their dependencies (providers, models, data stores, infrastructure)
Model versioning and artifact storage with backup capabilities
Understanding of which AI features are business-critical vs. best-effort
On-call procedures and incident management infrastructure

AI Disaster Scenarios Are Different

Traditional disaster recovery plans assume that recovery means restoring the same application to the same state. AI disaster recovery adds scenarios that have no traditional equivalent: your model provider goes down (your code is fine but the model is unreachable), your model is corrupted (the application is running but producing wrong results), your training data is lost (the running model still works but you cannot retrain it), or GPU availability evaporates (training and scaling become impossible even though current serving continues). Each scenario requires different recovery strategies and different RTO/RPO targets.

RTO/RPO Framework for AI Systems

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) must be defined separately for model serving and model training. Serving RTO is how long you can be without any AI inference capability. Training RPO is how much training data and progress you can afford to lose. These targets vary by system criticality: a customer-facing recommendation engine has different RTO requirements than an internal analytics model.

Scenario	Serving RTO	Training RPO	Primary Strategy	Fallback Strategy
LLM Provider Outage	< 5 minutes	N/A (no training loss)	Automatic failover to secondary provider	Serve cached responses + graceful degradation
Model Corruption	< 15 minutes	Last known good model version	Automated rollback to previous model version	Emergency retraining from checkpoint
Training Data Loss	No serving impact	< 24 hours of training data	Restore from backup storage	Rebuild from source systems with lineage records
Infrastructure Failure	< 30 minutes	Last checkpoint	Multi-region failover	Cloud provider migration to backup region
Cascading Failure	< 5 minutes (per component)	N/A	Circuit breakers + dependency isolation	Load shedding: disable non-critical AI features

Multi-Provider Failover

For organizations using third-party LLM APIs, provider outages are the most likely disaster scenario. Provider outages are not rare: every major LLM provider has experienced multi-hour outages. Multi-provider failover maintains integrations with at least two LLM providers and automatically routes traffic to the secondary provider when the primary is unavailable. The key challenge is maintaining prompt and evaluation parity across providers, since the same prompt may produce different quality results on different models.

provider-failover.ts

/**
 * Multi-provider LLM failover.
 *
 * Automatically routes requests to a backup LLM provider
 * when the primary provider is unavailable or degraded.
 */

interface ProviderConfig {
  name: string;
  model: string;
  endpoint: string;
  apiKey: string;
  priority: number;     // Lower = higher priority
  healthCheckUrl: string;
  maxLatencyMs: number;
}

interface FailoverState {
  currentProvider: string;
  failoverActive: boolean;
  failoverStartedAt: number | null;
  consecutiveFailures: Map<string, number>;
}

class LLMFailoverRouter {
  private state: FailoverState;
  private readonly failureThreshold = 3;
  private readonly recoveryCheckIntervalMs = 30000; // 30 sec

  constructor(
    private readonly providers: ProviderConfig[],
  ) {
    this.providers.sort((a, b) => a.priority - b.priority);
    this.state = {
      currentProvider: providers[0].name,
      failoverActive: false,
      failoverStartedAt: null,
      consecutiveFailures: new Map(),
    };
  }

  async route(
    request: { prompt: string; maxTokens: number },
  ): Promise<{ response: string; provider: string }> {
    // Try providers in priority order
    for (const provider of this.providers) {
      try {
        const response = await this.callProvider(
          provider, request,
        );
        // Reset failure count on success
        this.state.consecutiveFailures.set(
          provider.name, 0,
        );
        return { response, provider: provider.name };
      } catch (error) {
        this.recordFailure(provider.name);
        // Try next provider
        continue;
      }
    }

    // All providers failed -- serve cached response if available
    throw new Error(
      "All LLM providers unavailable. "
      + "Activate cached response serving.",
    );
  }

  private recordFailure(providerName: string): void {
    const current =
      this.state.consecutiveFailures.get(providerName) || 0;
    this.state.consecutiveFailures.set(
      providerName, current + 1,
    );

    if (current + 1 >= this.failureThreshold) {
      console.warn(
        `Provider ${providerName} marked unhealthy after `
        + `${current + 1} consecutive failures`,
      );
    }
  }

  private async callProvider(
    provider: ProviderConfig,
    request: { prompt: string; maxTokens: number },
  ): Promise<string> {
    // Check if provider is healthy
    const failures =
      this.state.consecutiveFailures.get(provider.name) || 0;
    if (failures >= this.failureThreshold) {
      throw new Error(
        `Provider ${provider.name} is unhealthy`,
      );
    }

    // Implement actual API call here
    // This is a placeholder for the provider-specific logic
    const response = await fetch(provider.endpoint, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${provider.apiKey}`,
      },
      body: JSON.stringify(request),
      signal: AbortSignal.timeout(provider.maxLatencyMs),
    });

    if (!response.ok) {
      throw new Error(`Provider returned ${response.status}`);
    }

    const data = await response.json();
    return data.content || data.choices?.[0]?.message?.content;
  }
}

Graceful Degradation Patterns

Graceful degradation defines what the user experience looks like when AI capabilities are reduced or unavailable. Rather than showing an error page, the application provides reduced functionality using non-AI fallbacks. A recommendation system falls back to popularity-based recommendations. A search system falls back to keyword matching. A content generation system shows a message explaining that the feature is temporarily unavailable. Every AI feature should have a defined degradation path documented before it launches.

1
Tier 1: Model Fallback
Switch to a simpler, locally-hosted model that provides lower quality but maintains the AI-powered experience. Use a fine-tuned smaller model or a rule-based system. This is the preferred degradation tier because users still get an AI experience.
2
Tier 2: Cached Response Serving
Serve cached responses for common queries. Works well for applications with repetitive query patterns. Combine with a 'results may be from cache' indicator so users understand that responses may not reflect the latest data.
3
Tier 3: Non-AI Fallback
Fall back to a non-AI implementation: keyword search instead of semantic search, rule-based recommendations instead of ML recommendations, manual processes instead of automated classification. This tier preserves functionality but loses the AI quality advantage.
4
Tier 4: Feature Disabled
Disable the AI feature entirely and show a user-friendly message. This is the last resort for features where no non-AI fallback exists or where a poor fallback would be worse than no feature at all.

Model Version Management

Model version management is the foundation of AI disaster recovery. Without the ability to quickly rollback to a previous model version, any model corruption or quality degradation becomes a protracted incident. Your model registry must maintain at least the last three production model versions in a deployment-ready state, with the ability to promote any version to production within minutes. Treat model versions like database backups: test your restoration process regularly.

Test your disaster recovery procedures regularly. A multi-provider failover that has never been tested in production will almost certainly not work correctly when you need it. Schedule quarterly DR drills that simulate each failure scenario. Time the recovery and identify bottlenecks. Update procedures based on what you learn.

Failover Infrastructure

Data and Model Recovery

Version History

1.0.0 · 2026-03-01

• Initial release with five AI disaster scenarios and RTO/RPO framework
• Multi-provider LLM failover implementation in TypeScript
• Four-tier graceful degradation pattern for AI features
• Model version management and rollback guidance
• DR infrastructure and recovery readiness checklists

AI Disaster Scenarios Are Different

RTO/RPO Framework for AI Systems

Scenario	Serving RTO	Training RPO	Primary Strategy	Fallback Strategy
LLM Provider Outage	< 5 minutes	N/A (no training loss)	Automatic failover to secondary provider	Serve cached responses + graceful degradation
Model Corruption	< 15 minutes	Last known good model version	Automated rollback to previous model version	Emergency retraining from checkpoint
Training Data Loss	No serving impact	< 24 hours of training data	Restore from backup storage	Rebuild from source systems with lineage records
Infrastructure Failure	< 30 minutes	Last checkpoint	Multi-region failover	Cloud provider migration to backup region
Cascading Failure	< 5 minutes (per component)	N/A	Circuit breakers + dependency isolation	Load shedding: disable non-critical AI features

Multi-Provider Failover

provider-failover.ts

/**
 * Multi-provider LLM failover.
 *
 * Automatically routes requests to a backup LLM provider
 * when the primary provider is unavailable or degraded.
 */

interface ProviderConfig {
  name: string;
  model: string;
  endpoint: string;
  apiKey: string;
  priority: number;     // Lower = higher priority
  healthCheckUrl: string;
  maxLatencyMs: number;
}

interface FailoverState {
  currentProvider: string;
  failoverActive: boolean;
  failoverStartedAt: number | null;
  consecutiveFailures: Map<string, number>;
}

class LLMFailoverRouter {
  private state: FailoverState;
  private readonly failureThreshold = 3;
  private readonly recoveryCheckIntervalMs = 30000; // 30 sec

  constructor(
    private readonly providers: ProviderConfig[],
  ) {
    this.providers.sort((a, b) => a.priority - b.priority);
    this.state = {
      currentProvider: providers[0].name,
      failoverActive: false,
      failoverStartedAt: null,
      consecutiveFailures: new Map(),
    };
  }

  async route(
    request: { prompt: string; maxTokens: number },
  ): Promise<{ response: string; provider: string }> {
    // Try providers in priority order
    for (const provider of this.providers) {
      try {
        const response = await this.callProvider(
          provider, request,
        );
        // Reset failure count on success
        this.state.consecutiveFailures.set(
          provider.name, 0,
        );
        return { response, provider: provider.name };
      } catch (error) {
        this.recordFailure(provider.name);
        // Try next provider
        continue;
      }
    }

    // All providers failed -- serve cached response if available
    throw new Error(
      "All LLM providers unavailable. "
      + "Activate cached response serving.",
    );
  }

  private recordFailure(providerName: string): void {
    const current =
      this.state.consecutiveFailures.get(providerName) || 0;
    this.state.consecutiveFailures.set(
      providerName, current + 1,
    );

    if (current + 1 >= this.failureThreshold) {
      console.warn(
        `Provider ${providerName} marked unhealthy after `
        + `${current + 1} consecutive failures`,
      );
    }
  }

  private async callProvider(
    provider: ProviderConfig,
    request: { prompt: string; maxTokens: number },
  ): Promise<string> {
    // Check if provider is healthy
    const failures =
      this.state.consecutiveFailures.get(provider.name) || 0;
    if (failures >= this.failureThreshold) {
      throw new Error(
        `Provider ${provider.name} is unhealthy`,
      );
    }

    // Implement actual API call here
    // This is a placeholder for the provider-specific logic
    const response = await fetch(provider.endpoint, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${provider.apiKey}`,
      },
      body: JSON.stringify(request),
      signal: AbortSignal.timeout(provider.maxLatencyMs),
    });

    if (!response.ok) {
      throw new Error(`Provider returned ${response.status}`);
    }

    const data = await response.json();
    return data.content || data.choices?.[0]?.message?.content;
  }
}

Graceful Degradation Patterns

Tier 1: Model Fallback

Switch to a simpler, locally-hosted model that provides lower quality but maintains the AI-powered experience. Use a fine-tuned smaller model or a rule-based system. This is the preferred degradation tier because users still get an AI experience.

Tier 2: Cached Response Serving

Serve cached responses for common queries. Works well for applications with repetitive query patterns. Combine with a 'results may be from cache' indicator so users understand that responses may not reflect the latest data.

Tier 3: Non-AI Fallback

Fall back to a non-AI implementation: keyword search instead of semantic search, rule-based recommendations instead of ML recommendations, manual processes instead of automated classification. This tier preserves functionality but loses the AI quality advantage.

Tier 4: Feature Disabled

Disable the AI feature entirely and show a user-friendly message. This is the last resort for features where no non-AI fallback exists or where a poor fallback would be worse than no feature at all.

Model Version Management

Failover Infrastructure

Data and Model Recovery

Version History

1.0.0 · 2026-03-01

• Initial release with five AI disaster scenarios and RTO/RPO framework
• Multi-provider LLM failover implementation in TypeScript
• Four-tier graceful degradation pattern for AI features
• Model version management and rollback guidance
• DR infrastructure and recovery readiness checklists

AI System Disaster Recovery

AI Disaster Scenarios Are Different

RTO/RPO Framework for AI Systems

Multi-Provider Failover

Graceful Degradation Patterns

Tier 1: Model Fallback

Tier 2: Cached Response Serving

Tier 3: Non-AI Fallback

Tier 4: Feature Disabled

Model Version Management

Failover Infrastructure

Data and Model Recovery

Version History

Related content

AI System Disaster Recovery

AI Disaster Scenarios Are Different

RTO/RPO Framework for AI Systems

Multi-Provider Failover

Graceful Degradation Patterns

Tier 1: Model Fallback

Tier 2: Cached Response Serving

Tier 3: Non-AI Fallback

Tier 4: Feature Disabled

Model Version Management

Failover Infrastructure

Data and Model Recovery

Version History

Related content