Key Takeaway

Effective model monitoring combines statistical drift detection with business metric tracking, because data drift only matters when it impacts the outcomes your stakeholders care about. This playbook covers four monitoring layers with specific metrics, alert thresholds, detection methods, and automated response actions for each layer.

Prerequisites

At least one ML model serving production traffic with logged predictions
An observability stack (Prometheus/Grafana, Datadog, or equivalent) for metrics collection
Access to ground truth labels or a proxy for model accuracy measurement
A reference dataset representing the expected input distribution (typically the test or validation set)
Basic understanding of statistical tests (KS test, PSI) and drift detection concepts

The Four Monitoring Layers

Model monitoring operates at four layers, each answering a different question. Data quality monitoring asks: is the input data well-formed and within expected bounds? Feature drift monitoring asks: has the statistical distribution of inputs changed since training? Model performance monitoring asks: is the model still producing accurate predictions? Business impact monitoring asks: are the model's predictions driving the business outcomes we expect? Each layer catches different failure modes, and no single layer is sufficient on its own.

Layer 1: Data Quality Monitoring

Data quality monitoring is the first line of defense. It catches issues before they reach the model: schema violations (unexpected types, missing required fields), value range violations (negative ages, future dates, out-of-vocabulary categories), null rate spikes (a feature that is suddenly missing for a large percentage of requests), and volume anomalies (traffic significantly above or below expected levels). These checks should run on every incoming request or batch, with alerting thresholds calibrated to your traffic patterns.

data_quality_monitor.py

"""Data quality monitoring for model inputs.

Validates incoming data against expected schemas
and distributions, catching upstream pipeline issues
before they corrupt model predictions.
"""

from dataclasses import dataclass
from typing import Dict, List, Optional, Any
import numpy as np


@dataclass
class QualityCheckResult:
    """Result of a single data quality check."""
    check_name: str
    passed: bool
    metric_value: float
    threshold: float
    details: str


class DataQualityMonitor:
    """Monitor incoming model inputs for quality issues."""

    def __init__(
        self,
        feature_schemas: Dict[str, Dict[str, Any]],
        null_rate_threshold: float = 0.05,
        volume_deviation_threshold: float = 0.5,
    ):
        self.schemas = feature_schemas
        self.null_threshold = null_rate_threshold
        self.volume_threshold = volume_deviation_threshold
        self._baseline_volume: Optional[float] = None

    def check_null_rates(
        self, batch: Dict[str, List],
    ) -> List[QualityCheckResult]:
        """Check null rates for each feature in a batch."""
        results = []
        for feature, values in batch.items():
            null_count = sum(1 for v in values if v is None)
            null_rate = null_count / len(values) if values else 0

            results.append(QualityCheckResult(
                check_name=f"null_rate_{feature}",
                passed=null_rate <= self.null_threshold,
                metric_value=null_rate,
                threshold=self.null_threshold,
                details=(
                    f"{feature}: {null_rate:.2%} null "
                    f"({null_count}/{len(values)})"
                ),
            ))
        return results

    def check_value_ranges(
        self, batch: Dict[str, List],
    ) -> List[QualityCheckResult]:
        """Validate feature values against defined ranges."""
        results = []
        for feature, values in batch.items():
            schema = self.schemas.get(feature, {})
            min_val = schema.get("min")
            max_val = schema.get("max")

            if min_val is None and max_val is None:
                continue

            non_null = [v for v in values if v is not None]
            if not non_null:
                continue

            violations = sum(
                1 for v in non_null
                if (min_val is not None and v < min_val)
                or (max_val is not None and v > max_val)
            )
            violation_rate = violations / len(non_null)

            results.append(QualityCheckResult(
                check_name=f"range_{feature}",
                passed=violation_rate <= 0.01,
                metric_value=violation_rate,
                threshold=0.01,
                details=(
                    f"{feature}: {violations} values "
                    f"outside [{min_val}, {max_val}]"
                ),
            ))
        return results

Layer 2: Feature Drift Detection

Feature drift occurs when the statistical distribution of input features shifts relative to the training data. This is the most common cause of gradual model degradation: the model was trained on data from one distribution, and over time the real world changes, causing the input distribution to diverge. Detecting drift requires comparing the current input distribution against a reference distribution (typically the training or validation set) using statistical tests.

The two most commonly used drift detection methods are the Population Stability Index (PSI) and the Kolmogorov-Smirnov (KS) test. PSI is preferred for production monitoring because it is intuitive (values below 0.1 indicate no significant drift, 0.1-0.2 indicates moderate drift, above 0.2 indicates significant drift) and works well for both numerical and categorical features. The KS test provides a formal statistical test with a p-value, which is useful for rigorous analysis but can be oversensitive on large sample sizes.

drift_detector.py

"""Feature drift detection using PSI and KS tests.

Compare current production data distributions against
reference distributions to detect feature drift.
"""

import numpy as np
from scipy import stats
from typing import Dict, Tuple


def population_stability_index(
    reference: np.ndarray,
    current: np.ndarray,
    bins: int = 10,
) -> float:
    """Calculate Population Stability Index (PSI).

    PSI interpretation:
      < 0.1:  No significant drift
      0.1-0.2: Moderate drift, investigate
      > 0.2:  Significant drift, action needed

    Args:
        reference: Reference distribution (training data)
        current: Current production distribution
        bins: Number of bins for histogram comparison
    """
    # Create bins from the reference distribution
    breakpoints = np.percentile(
        reference, np.linspace(0, 100, bins + 1)
    )
    breakpoints = np.unique(breakpoints)

    # Calculate proportions in each bin
    ref_counts = np.histogram(reference, bins=breakpoints)[0]
    cur_counts = np.histogram(current, bins=breakpoints)[0]

    # Normalize to proportions (avoid division by zero)
    ref_props = (ref_counts + 1e-6) / ref_counts.sum()
    cur_props = (cur_counts + 1e-6) / cur_counts.sum()

    # PSI formula
    psi = np.sum(
        (cur_props - ref_props) * np.log(cur_props / ref_props)
    )
    return float(psi)


def detect_drift(
    reference_data: Dict[str, np.ndarray],
    current_data: Dict[str, np.ndarray],
    psi_threshold: float = 0.2,
) -> Dict[str, Dict]:
    """Run drift detection across all features.

    Returns a dict mapping feature names to drift results.
    """
    results = {}

    for feature in reference_data:
        if feature not in current_data:
            continue

        ref = reference_data[feature]
        cur = current_data[feature]

        psi = population_stability_index(ref, cur)
        ks_stat, ks_pvalue = stats.ks_2samp(ref, cur)

        results[feature] = {
            "psi": round(psi, 4),
            "ks_statistic": round(ks_stat, 4),
            "ks_pvalue": round(ks_pvalue, 4),
            "drifted": psi > psi_threshold,
            "severity": (
                "none" if psi < 0.1
                else "moderate" if psi < 0.2
                else "significant"
            ),
        }

    return results

Layer 3: Model Performance Monitoring

Performance monitoring tracks whether the model's predictions are still accurate. This is the most important monitoring layer but also the hardest to implement, because it requires ground truth labels. In many production systems, ground truth labels arrive with a delay (e.g., you know whether a loan defaulted months after the prediction was made) or are only available for a subset of predictions (e.g., only predictions that were acted upon have observable outcomes). Design your monitoring around the reality of your label availability, not around the ideal case.

When ground truth labels are delayed, use proxy metrics as leading indicators. For a recommendation system, click-through rate is an immediate proxy for recommendation quality. For a fraud detection system, customer dispute rate is a lagging but reliable quality signal. Define both leading proxies and lagging ground truth metrics for every monitored model.

Layer 4: Business Impact Monitoring

Business impact monitoring connects model performance to business outcomes. Feature drift and accuracy degradation are only meaningful insofar as they affect the metrics that stakeholders care about: revenue, conversion rates, customer satisfaction, operational efficiency. This layer correlates model quality metrics with business KPIs to answer the question executives ask: is the AI making us better off?

Alert Threshold Calibration

Poor threshold calibration is the most common reason model monitoring fails in practice. Thresholds that are too tight generate alert fatigue: the on-call engineer investigates five drift alerts per day, finds that none of them actually impact model quality, and starts ignoring all drift alerts. Thresholds that are too loose miss real degradation. Calibrate thresholds using historical data: run your drift detection on past production data where you know the ground truth, and find the threshold that catches real degradation while minimizing false positives.

Metric	Measurement	Warning Threshold	Critical Threshold	Response
Feature Drift (PSI)	Daily, per feature	PSI > 0.1	PSI > 0.2	Warning: investigate within 24h. Critical: trigger evaluation pipeline immediately.
Prediction Accuracy	Hourly (proxy), daily (ground truth)	5% relative decrease from baseline	10% relative decrease from baseline	Warning: review in next standup. Critical: initiate incident response.
Null Rate Spike	Per batch or per hour	> 2x baseline null rate	> 5x baseline null rate	Warning: check upstream pipeline. Critical: page data engineering.
Latency (p99)	Continuous	> 2x baseline p99	> 5x baseline or > SLA	Warning: investigate load and model size. Critical: scale or rollback.
Confidence Distribution	Hourly	Mean confidence drops > 10%	Mean confidence drops > 25%	Warning: check for distribution shift. Critical: evaluate model quality.

0/10 completed

Version History

1.0.0 · 2026-03-01

• Initial release with four-layer monitoring framework
• Data quality monitor and drift detection code examples (PSI and KS tests)
• Alert threshold calibration table with warning and critical levels
• Proxy metric guidance for systems with delayed ground truth
• Monitoring readiness checklist with 10 items

Model Monitoring Playbook

The Four Monitoring Layers

Layer 1: Data Quality Monitoring

Layer 2: Feature Drift Detection

Layer 3: Model Performance Monitoring

Layer 4: Business Impact Monitoring

Alert Threshold Calibration

Version History

Related content

Model Monitoring Playbook

The Four Monitoring Layers

Layer 1: Data Quality Monitoring

Layer 2: Feature Drift Detection

Layer 3: Model Performance Monitoring

Layer 4: Business Impact Monitoring

Alert Threshold Calibration

Version History

Related content