Key Takeaway

The key difference between CI/CD for software and ML is that ML pipelines must treat data changes as first-class triggers alongside code changes, and evaluation gates must prevent both accuracy regressions and fairness degradations. This guide covers a six-stage pipeline architecture with pipeline-as-code examples, evaluation gate design, and canary deployment strategies.

Prerequisites

Existing CI/CD infrastructure for application code (GitHub Actions, GitLab CI, Jenkins, etc.)
ML training code in version control with reproducible training scripts
An evaluation dataset with ground truth labels for automated benchmarking
A model registry for storing and versioning model artifacts (MLflow, SageMaker, etc.)
Container infrastructure for reproducible training environments (Docker, Kubernetes)

ML Pipelines vs. Software Pipelines

Software CI/CD pipelines are triggered by code changes, run tests, and deploy application artifacts. ML CI/CD pipelines must handle two additional dimensions: data changes (a new training dataset should trigger retraining just as a code change triggers a rebuild) and model evaluation (passing unit tests is not sufficient -- the model must demonstrate that it meets quality thresholds on representative evaluation data). These additions make ML pipelines longer, more computationally expensive, and more complex to debug than software pipelines.

The pipeline architecture described here is designed for teams that already have software CI/CD in place. It extends your existing pipeline with ML-specific stages rather than replacing it. The model training, evaluation, and deployment stages are implemented as additional workflow steps that run after traditional code quality checks pass.

The Six Pipeline Stages

The pipeline is organized into six stages. Each stage has defined inputs, outputs, and failure conditions. If any stage fails, the pipeline stops and the model does not progress toward production. This stage-gate approach ensures that only models that pass comprehensive validation reach production users.

.github/workflows/ml-pipeline.yaml

# GitHub Actions ML CI/CD pipeline
# Triggers on code changes AND data version changes
name: ML Model Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'models/**'
      - 'training/**'
      - 'data/versions/**'
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Model to retrain'
        required: true
      data_version:
        description: 'Data version to train on'
        required: true

jobs:
  # Stage 1: Source Validation
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint training code
        run: ruff check training/
      - name: Validate data schema
        run: python scripts/validate_schema.py --data-version ${{ inputs.data_version || 'latest' }}
      - name: Scan dependencies
        run: pip-audit -r requirements.txt

  # Stage 2: Training
  train:
    needs: validate
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Train model
        run: |
          python training/train.py \
            --data-version ${{ inputs.data_version || 'latest' }} \
            --experiment-name ${{ github.run_id }} \
            --output-dir artifacts/
      - name: Upload model artifact
        uses: actions/upload-artifact@v4
        with:
          name: model-artifact
          path: artifacts/

  # Stage 3: Evaluation
  evaluate:
    needs: train
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with:
          name: model-artifact
          path: artifacts/
      - name: Run evaluation suite
        run: |
          python evaluation/evaluate.py \
            --model-path artifacts/model \
            --eval-dataset data/eval/ \
            --output evaluation-report.json
      - name: Check evaluation gates
        run: python evaluation/check_gates.py evaluation-report.json
      - name: Upload evaluation report
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-report
          path: evaluation-report.json

  # Stage 4: Register Model
  register:
    needs: evaluate
    runs-on: ubuntu-latest
    steps:
      - name: Register in model registry
        run: |
          python scripts/register_model.py \
            --model-path artifacts/model \
            --eval-report evaluation-report.json \
            --version ${{ github.sha }}

  # Stage 5: Canary Deployment
  canary:
    needs: register
    runs-on: ubuntu-latest
    environment: production-canary
    steps:
      - name: Deploy canary (5% traffic)
        run: |
          python scripts/deploy_canary.py \
            --model-version ${{ github.sha }} \
            --traffic-percentage 5
      - name: Monitor canary (30 min)
        run: python scripts/monitor_canary.py --duration 1800
      - name: Evaluate canary metrics
        run: python scripts/check_canary.py

  # Stage 6: Full Rollout
  rollout:
    needs: canary
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Progressive rollout
        run: |
          python scripts/deploy_canary.py \
            --model-version ${{ github.sha }} \
            --traffic-percentage 100
      - name: Create deployment record
        run: python scripts/record_deployment.py

Evaluation Gates

Evaluation gates are the most critical component of the ML pipeline. They prevent model regressions from reaching production by comparing the candidate model's performance against the current production model. A well-designed evaluation gate checks three dimensions: overall quality (is the new model at least as good as the current one?), slice-based quality (does the new model maintain quality across all important subgroups?), and fairness (does the new model maintain or improve fairness metrics?).

check_gates.py

"""Evaluation gate checker for ML CI/CD pipeline.

Compares a candidate model's evaluation report against
the production baseline and determines whether the
candidate should proceed to deployment.
"""

import json
import sys
from typing import Dict, List


def check_evaluation_gates(
    report_path: str,
    baseline_path: str = "baselines/production.json",
    regression_threshold: float = 0.02,
) -> bool:
    """Check whether a model passes evaluation gates.

    Gates:
      1. Overall accuracy >= baseline - regression_threshold
      2. No slice accuracy drops > 2x regression_threshold
      3. Fairness metrics within acceptable bounds
      4. Latency within SLA

    Returns True if all gates pass.
    """
    with open(report_path) as f:
        report = json.load(f)
    with open(baseline_path) as f:
        baseline = json.load(f)

    failures: List[str] = []

    # Gate 1: Overall accuracy
    candidate_acc = report["overall_accuracy"]
    baseline_acc = baseline["overall_accuracy"]
    if candidate_acc < baseline_acc - regression_threshold:
        failures.append(
            f"Overall accuracy regression: "
            f"{candidate_acc:.4f} < {baseline_acc:.4f} - "
            f"{regression_threshold}"
        )

    # Gate 2: Slice-based accuracy
    for slice_name, slice_acc in report.get("slices", {}).items():
        baseline_slice = baseline.get("slices", {}).get(
            slice_name, baseline_acc
        )
        max_drop = regression_threshold * 2
        if slice_acc < baseline_slice - max_drop:
            failures.append(
                f"Slice '{slice_name}' regression: "
                f"{slice_acc:.4f} < {baseline_slice:.4f} - "
                f"{max_drop}"
            )

    # Gate 3: Fairness metrics
    candidate_dpd = abs(
        report.get("demographic_parity_diff", 0)
    )
    if candidate_dpd > 0.2:
        failures.append(
            f"Fairness violation: demographic parity "
            f"difference {candidate_dpd:.4f} > 0.2"
        )

    # Gate 4: Latency
    candidate_p99 = report.get("latency_p99_ms", 0)
    sla_p99 = baseline.get("latency_sla_p99_ms", 5000)
    if candidate_p99 > sla_p99:
        failures.append(
            f"Latency SLA violation: p99 {candidate_p99}ms "
            f"> SLA {sla_p99}ms"
        )

    if failures:
        print("EVALUATION GATES FAILED:")
        for f in failures:
            print(f"  FAIL: {f}")
        return False
    else:
        print("ALL EVALUATION GATES PASSED")
        return True


if __name__ == "__main__":
    report = sys.argv[1] if len(sys.argv) > 1 else "eval.json"
    passed = check_evaluation_gates(report)
    sys.exit(0 if passed else 1)

Canary Deployment Strategy

Canary deployment routes a small percentage of production traffic to the new model while the majority continues to be served by the current production model. The canary period provides real-world validation that offline evaluation cannot: does the model perform well on actual production traffic patterns? Does it handle edge cases that the evaluation dataset does not cover? Does it maintain quality under production load conditions?

A typical canary progression starts at 5% traffic for 30 minutes (detecting major regressions), expands to 25% for 4 hours (validating across traffic diversity), then to 50% for 24 hours (confirming stability), and finally to 100% (full rollout). Automated rollback triggers revert to the previous model if any quality metric drops below the threshold during any stage.

The most common CI/CD pipeline failure for ML is treating the evaluation dataset as static. If your evaluation dataset is never updated, it gradually becomes unrepresentative of production traffic, and models that pass evaluation gates start failing in production. Schedule quarterly evaluation dataset refreshes using sampled and labeled production data.

Pipeline Infrastructure

Evaluation and Deployment

Version History

1.0.0 · 2026-03-01

• Initial release with six-stage ML CI/CD pipeline architecture
• Complete GitHub Actions pipeline configuration
• Evaluation gate checker with four-dimensional quality validation
• Canary deployment strategy with progressive traffic expansion
• Pipeline infrastructure and evaluation readiness checklists

ML Pipelines vs. Software Pipelines

The Six Pipeline Stages

.github/workflows/ml-pipeline.yaml

# GitHub Actions ML CI/CD pipeline
# Triggers on code changes AND data version changes
name: ML Model Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'models/**'
      - 'training/**'
      - 'data/versions/**'
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Model to retrain'
        required: true
      data_version:
        description: 'Data version to train on'
        required: true

jobs:
  # Stage 1: Source Validation
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint training code
        run: ruff check training/
      - name: Validate data schema
        run: python scripts/validate_schema.py --data-version ${{ inputs.data_version || 'latest' }}
      - name: Scan dependencies
        run: pip-audit -r requirements.txt

  # Stage 2: Training
  train:
    needs: validate
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Train model
        run: |
          python training/train.py \
            --data-version ${{ inputs.data_version || 'latest' }} \
            --experiment-name ${{ github.run_id }} \
            --output-dir artifacts/
      - name: Upload model artifact
        uses: actions/upload-artifact@v4
        with:
          name: model-artifact
          path: artifacts/

  # Stage 3: Evaluation
  evaluate:
    needs: train
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with:
          name: model-artifact
          path: artifacts/
      - name: Run evaluation suite
        run: |
          python evaluation/evaluate.py \
            --model-path artifacts/model \
            --eval-dataset data/eval/ \
            --output evaluation-report.json
      - name: Check evaluation gates
        run: python evaluation/check_gates.py evaluation-report.json
      - name: Upload evaluation report
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-report
          path: evaluation-report.json

  # Stage 4: Register Model
  register:
    needs: evaluate
    runs-on: ubuntu-latest
    steps:
      - name: Register in model registry
        run: |
          python scripts/register_model.py \
            --model-path artifacts/model \
            --eval-report evaluation-report.json \
            --version ${{ github.sha }}

  # Stage 5: Canary Deployment
  canary:
    needs: register
    runs-on: ubuntu-latest
    environment: production-canary
    steps:
      - name: Deploy canary (5% traffic)
        run: |
          python scripts/deploy_canary.py \
            --model-version ${{ github.sha }} \
            --traffic-percentage 5
      - name: Monitor canary (30 min)
        run: python scripts/monitor_canary.py --duration 1800
      - name: Evaluate canary metrics
        run: python scripts/check_canary.py

  # Stage 6: Full Rollout
  rollout:
    needs: canary
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Progressive rollout
        run: |
          python scripts/deploy_canary.py \
            --model-version ${{ github.sha }} \
            --traffic-percentage 100
      - name: Create deployment record
        run: python scripts/record_deployment.py

Evaluation Gates

check_gates.py

"""Evaluation gate checker for ML CI/CD pipeline.

Compares a candidate model's evaluation report against
the production baseline and determines whether the
candidate should proceed to deployment.
"""

import json
import sys
from typing import Dict, List


def check_evaluation_gates(
    report_path: str,
    baseline_path: str = "baselines/production.json",
    regression_threshold: float = 0.02,
) -> bool:
    """Check whether a model passes evaluation gates.

    Gates:
      1. Overall accuracy >= baseline - regression_threshold
      2. No slice accuracy drops > 2x regression_threshold
      3. Fairness metrics within acceptable bounds
      4. Latency within SLA

    Returns True if all gates pass.
    """
    with open(report_path) as f:
        report = json.load(f)
    with open(baseline_path) as f:
        baseline = json.load(f)

    failures: List[str] = []

    # Gate 1: Overall accuracy
    candidate_acc = report["overall_accuracy"]
    baseline_acc = baseline["overall_accuracy"]
    if candidate_acc < baseline_acc - regression_threshold:
        failures.append(
            f"Overall accuracy regression: "
            f"{candidate_acc:.4f} < {baseline_acc:.4f} - "
            f"{regression_threshold}"
        )

    # Gate 2: Slice-based accuracy
    for slice_name, slice_acc in report.get("slices", {}).items():
        baseline_slice = baseline.get("slices", {}).get(
            slice_name, baseline_acc
        )
        max_drop = regression_threshold * 2
        if slice_acc < baseline_slice - max_drop:
            failures.append(
                f"Slice '{slice_name}' regression: "
                f"{slice_acc:.4f} < {baseline_slice:.4f} - "
                f"{max_drop}"
            )

    # Gate 3: Fairness metrics
    candidate_dpd = abs(
        report.get("demographic_parity_diff", 0)
    )
    if candidate_dpd > 0.2:
        failures.append(
            f"Fairness violation: demographic parity "
            f"difference {candidate_dpd:.4f} > 0.2"
        )

    # Gate 4: Latency
    candidate_p99 = report.get("latency_p99_ms", 0)
    sla_p99 = baseline.get("latency_sla_p99_ms", 5000)
    if candidate_p99 > sla_p99:
        failures.append(
            f"Latency SLA violation: p99 {candidate_p99}ms "
            f"> SLA {sla_p99}ms"
        )

    if failures:
        print("EVALUATION GATES FAILED:")
        for f in failures:
            print(f"  FAIL: {f}")
        return False
    else:
        print("ALL EVALUATION GATES PASSED")
        return True


if __name__ == "__main__":
    report = sys.argv[1] if len(sys.argv) > 1 else "eval.json"
    passed = check_evaluation_gates(report)
    sys.exit(0 if passed else 1)

Canary Deployment Strategy

Pipeline Infrastructure

Evaluation and Deployment

Version History

1.0.0 · 2026-03-01

• Initial release with six-stage ML CI/CD pipeline architecture
• Complete GitHub Actions pipeline configuration
• Evaluation gate checker with four-dimensional quality validation
• Canary deployment strategy with progressive traffic expansion
• Pipeline infrastructure and evaluation readiness checklists

CI/CD for ML Models

ML Pipelines vs. Software Pipelines

The Six Pipeline Stages

Evaluation Gates

Canary Deployment Strategy

Pipeline Infrastructure

Evaluation and Deployment

Version History

Related content

CI/CD for ML Models

ML Pipelines vs. Software Pipelines

The Six Pipeline Stages

Evaluation Gates

Canary Deployment Strategy

Pipeline Infrastructure

Evaluation and Deployment

Version History

Related content