Key Takeaway
The key difference between CI/CD for software and ML is that ML pipelines must treat data changes as first-class triggers alongside code changes, and evaluation gates must prevent both accuracy regressions and fairness degradations. This guide covers a six-stage pipeline architecture with pipeline-as-code examples, evaluation gate design, and canary deployment strategies.
Prerequisites
- Existing CI/CD infrastructure for application code (GitHub Actions, GitLab CI, Jenkins, etc.)
- ML training code in version control with reproducible training scripts
- An evaluation dataset with ground truth labels for automated benchmarking
- A model registry for storing and versioning model artifacts (MLflow, SageMaker, etc.)
- Container infrastructure for reproducible training environments (Docker, Kubernetes)
ML Pipelines vs. Software Pipelines
Software CI/CD pipelines are triggered by code changes, run tests, and deploy application artifacts. ML CI/CD pipelines must handle two additional dimensions: data changes (a new training dataset should trigger retraining just as a code change triggers a rebuild) and model evaluation (passing unit tests is not sufficient -- the model must demonstrate that it meets quality thresholds on representative evaluation data). These additions make ML pipelines longer, more computationally expensive, and more complex to debug than software pipelines.
The pipeline architecture described here is designed for teams that already have software CI/CD in place. It extends your existing pipeline with ML-specific stages rather than replacing it. The model training, evaluation, and deployment stages are implemented as additional workflow steps that run after traditional code quality checks pass.
The Six Pipeline Stages
The pipeline is organized into six stages. Each stage has defined inputs, outputs, and failure conditions. If any stage fails, the pipeline stops and the model does not progress toward production. This stage-gate approach ensures that only models that pass comprehensive validation reach production users.
# GitHub Actions ML CI/CD pipeline
# Triggers on code changes AND data version changes
name: ML Model Pipeline
on:
push:
branches: [main]
paths:
- 'models/**'
- 'training/**'
- 'data/versions/**'
workflow_dispatch:
inputs:
model_name:
description: 'Model to retrain'
required: true
data_version:
description: 'Data version to train on'
required: true
jobs:
# Stage 1: Source Validation
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Lint training code
run: ruff check training/
- name: Validate data schema
run: python scripts/validate_schema.py --data-version ${{ inputs.data_version || 'latest' }}
- name: Scan dependencies
run: pip-audit -r requirements.txt
# Stage 2: Training
train:
needs: validate
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- name: Train model
run: |
python training/train.py \
--data-version ${{ inputs.data_version || 'latest' }} \
--experiment-name ${{ github.run_id }} \
--output-dir artifacts/
- name: Upload model artifact
uses: actions/upload-artifact@v4
with:
name: model-artifact
path: artifacts/
# Stage 3: Evaluation
evaluate:
needs: train
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with:
name: model-artifact
path: artifacts/
- name: Run evaluation suite
run: |
python evaluation/evaluate.py \
--model-path artifacts/model \
--eval-dataset data/eval/ \
--output evaluation-report.json
- name: Check evaluation gates
run: python evaluation/check_gates.py evaluation-report.json
- name: Upload evaluation report
uses: actions/upload-artifact@v4
with:
name: evaluation-report
path: evaluation-report.json
# Stage 4: Register Model
register:
needs: evaluate
runs-on: ubuntu-latest
steps:
- name: Register in model registry
run: |
python scripts/register_model.py \
--model-path artifacts/model \
--eval-report evaluation-report.json \
--version ${{ github.sha }}
# Stage 5: Canary Deployment
canary:
needs: register
runs-on: ubuntu-latest
environment: production-canary
steps:
- name: Deploy canary (5% traffic)
run: |
python scripts/deploy_canary.py \
--model-version ${{ github.sha }} \
--traffic-percentage 5
- name: Monitor canary (30 min)
run: python scripts/monitor_canary.py --duration 1800
- name: Evaluate canary metrics
run: python scripts/check_canary.py
# Stage 6: Full Rollout
rollout:
needs: canary
runs-on: ubuntu-latest
environment: production
steps:
- name: Progressive rollout
run: |
python scripts/deploy_canary.py \
--model-version ${{ github.sha }} \
--traffic-percentage 100
- name: Create deployment record
run: python scripts/record_deployment.pyEvaluation Gates
Evaluation gates are the most critical component of the ML pipeline. They prevent model regressions from reaching production by comparing the candidate model's performance against the current production model. A well-designed evaluation gate checks three dimensions: overall quality (is the new model at least as good as the current one?), slice-based quality (does the new model maintain quality across all important subgroups?), and fairness (does the new model maintain or improve fairness metrics?).
"""Evaluation gate checker for ML CI/CD pipeline.
Compares a candidate model's evaluation report against
the production baseline and determines whether the
candidate should proceed to deployment.
"""
import json
import sys
from typing import Dict, List
def check_evaluation_gates(
report_path: str,
baseline_path: str = "baselines/production.json",
regression_threshold: float = 0.02,
) -> bool:
"""Check whether a model passes evaluation gates.
Gates:
1. Overall accuracy >= baseline - regression_threshold
2. No slice accuracy drops > 2x regression_threshold
3. Fairness metrics within acceptable bounds
4. Latency within SLA
Returns True if all gates pass.
"""
with open(report_path) as f:
report = json.load(f)
with open(baseline_path) as f:
baseline = json.load(f)
failures: List[str] = []
# Gate 1: Overall accuracy
candidate_acc = report["overall_accuracy"]
baseline_acc = baseline["overall_accuracy"]
if candidate_acc < baseline_acc - regression_threshold:
failures.append(
f"Overall accuracy regression: "
f"{candidate_acc:.4f} < {baseline_acc:.4f} - "
f"{regression_threshold}"
)
# Gate 2: Slice-based accuracy
for slice_name, slice_acc in report.get("slices", {}).items():
baseline_slice = baseline.get("slices", {}).get(
slice_name, baseline_acc
)
max_drop = regression_threshold * 2
if slice_acc < baseline_slice - max_drop:
failures.append(
f"Slice '{slice_name}' regression: "
f"{slice_acc:.4f} < {baseline_slice:.4f} - "
f"{max_drop}"
)
# Gate 3: Fairness metrics
candidate_dpd = abs(
report.get("demographic_parity_diff", 0)
)
if candidate_dpd > 0.2:
failures.append(
f"Fairness violation: demographic parity "
f"difference {candidate_dpd:.4f} > 0.2"
)
# Gate 4: Latency
candidate_p99 = report.get("latency_p99_ms", 0)
sla_p99 = baseline.get("latency_sla_p99_ms", 5000)
if candidate_p99 > sla_p99:
failures.append(
f"Latency SLA violation: p99 {candidate_p99}ms "
f"> SLA {sla_p99}ms"
)
if failures:
print("EVALUATION GATES FAILED:")
for f in failures:
print(f" FAIL: {f}")
return False
else:
print("ALL EVALUATION GATES PASSED")
return True
if __name__ == "__main__":
report = sys.argv[1] if len(sys.argv) > 1 else "eval.json"
passed = check_evaluation_gates(report)
sys.exit(0 if passed else 1)Canary Deployment Strategy
Canary deployment routes a small percentage of production traffic to the new model while the majority continues to be served by the current production model. The canary period provides real-world validation that offline evaluation cannot: does the model perform well on actual production traffic patterns? Does it handle edge cases that the evaluation dataset does not cover? Does it maintain quality under production load conditions?
A typical canary progression starts at 5% traffic for 30 minutes (detecting major regressions), expands to 25% for 4 hours (validating across traffic diversity), then to 50% for 24 hours (confirming stability), and finally to 100% (full rollout). Automated rollback triggers revert to the previous model if any quality metric drops below the threshold during any stage.
The most common CI/CD pipeline failure for ML is treating the evaluation dataset as static. If your evaluation dataset is never updated, it gradually becomes unrepresentative of production traffic, and models that pass evaluation gates start failing in production. Schedule quarterly evaluation dataset refreshes using sampled and labeled production data.
Pipeline Infrastructure
Evaluation and Deployment
Version History
1.0.0 · 2026-03-01
- • Initial release with six-stage ML CI/CD pipeline architecture
- • Complete GitHub Actions pipeline configuration
- • Evaluation gate checker with four-dimensional quality validation
- • Canary deployment strategy with progressive traffic expansion
- • Pipeline infrastructure and evaluation readiness checklists