Back to Projects
Platform Engineering

CI/CD Pipeline Optimization - Success Rate Recovery

"From 47% to 73% success rate in 11 days"

+26%
Success Rate
0
Runs Analyzed
70%
Faster Builds
0
Workers
0
Environments
0
Days

░▒▓ The Problem

Inherited a CI/CD pipeline with coin-flip reliability. Deployments were blocked, developer time was lost, and trust in the pipeline was eroding.

Starting State (Week 1)
PIPELINE STATUS: UNRELIABLE
═══════════════════════════════════════════════════════════════

Starting State:
├── Success Rate: ~47%
├── Failures: Deploy timeouts, billing limits, flaky tests
├── Impact: Blocked deployments, lost developer time
└── Status: UNRELIABLE

Every other deployment failed. Not acceptable.

░▒▓ Root Cause Analysis

Analyzed ~70 failed runs to identify failure patterns. Categorized each failure and measured frequency.

Failure Pattern Analysis
FAILURE BREAKDOWN (~70 failed runs analyzed)
═══════════════════════════════════════════════════════════════

┌──────────────────────────┬────────────┬──────────────────────────────┐
│ Failure Type             │ % of Total │ Root Cause                   │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ Deploy/Infrastructure    │    ~40%    │ Port conflicts, timeouts,    │
│                          │            │ runner resource limits       │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ Unit Tests               │    ~25%    │ Mocked dependencies,         │
│                          │            │ flaky assertions             │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ Type Check               │    ~20%    │ Code pushed without          │
│                          │            │ running local checks         │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ Build/Migration          │    ~10%    │ Long builds, migration       │
│                          │            │ conflicts                    │
├──────────────────────────┼────────────┼──────────────────────────────┤
│ Billing                  │     ~5%    │ GitHub Actions spending      │
│                          │            │ limits hit                   │
└──────────────────────────┴────────────┴──────────────────────────────┘

Top insight: 40% of failures were infrastructure, not code.

═══════════════════════════════════════════════════════════════

░▒▓ Solutions Implemented

[1]

Self-Hosted Runner

Migrated from GitHub-hosted to self-hosted runner. Eliminated billing failures completely and reduced queue wait times.

[2]

Docker Layer Caching

Added BuildKit cache mounts. Reduced build times from ~10 minutes to ~3 minutes. Eliminated timeout failures.

[3]

Separate Worker Dockerfile

Created Dockerfile.worker for background jobs. Workers no longer require full Next.js build. 14x faster deployments.

[4]

Pre-Built Artifacts

Build in CI, transfer artifacts to server. Eliminated server-side build failures and enabled verified builds.

[5]

Flaky Test Isolation

Temporarily disabled E2E tests blocking deploys. Stopped false failures from blocking releases.

░▒▓ Results

Tracked daily success rates over 11 days. Clear improvement trend with the lowest point triggering the most impactful fixes.

Daily Success Rate Trend
DAILY SUCCESS RATE TREND
═══════════════════════════════════════════════════════════════════

Feb 11: 52% ██████████░░░░░░░░░░  START
Feb 12: 50% ██████████░░░░░░░░░░  ▼
Feb 13: 38% ███████░░░░░░░░░░░░░  ▼▼ LOWEST (triggered action)
Feb 16: 53% ██████████░░░░░░░░░░  ▲▲ Recovery begins
Feb 17: 59% ███████████░░░░░░░░░  ▲
Feb 18: 59% ███████████░░░░░░░░░  ─
Feb 19: 83% ████████████████░░░░  ▲▲ BIG JUMP (self-hosted runner)
Feb 20: 69% █████████████░░░░░░░  ▼ (high volume day)
Feb 21: 88% █████████████████░░░  ▲▲ BEST DAY
Feb 22: 80% ████████████████░░░░  ─ Holding steady

═══════════════════════════════════════════════════════════════════
Improvement Summary
BEFORE vs AFTER
═══════════════════════════════════════════════════════════════

┌────────────────────────────┬──────────┬──────────┬───────────┐
│ Metric                     │ Before   │ After    │ Change    │
├────────────────────────────┼──────────┼──────────┼───────────┤
│ Weekly Success Rate        │ ~47%     │ ~66%     │ +19%      │
├────────────────────────────┼──────────┼──────────┼───────────┤
│ Best Day                   │ 52%      │ 88%      │ +36%      │
├────────────────────────────┼──────────┼──────────┼───────────┤
│ Oldest vs Newest 3 Days    │ 47%      │ 73%      │ +26%      │
├────────────────────────────┼──────────┼──────────┼───────────┤
│ Main App Build Time        │ ~10 min  │ ~3 min   │ -70%      │
├────────────────────────────┼──────────┼──────────┼───────────┤
│ Worker Build Time          │ ~7 min   │ ~30 sec  │ -93% (14x)│
├────────────────────────────┼──────────┼──────────┼───────────┤
│ Billing Failures           │ ~5%      │ 0%       │ Eliminated│
└────────────────────────────┴──────────┴──────────┴───────────┘

═══════════════════════════════════════════════════════════════

░▒▓ Pipeline Architecture

CI/CD Workflow Structure
GITHUB ACTIONS WORKFLOW
═══════════════════════════════════════════════════════════════

on:
  push:
    branches: [main, 'feature/**', 'fix/**']
  pull_request:
    branches: [main]

JOBS:
┌─────────────────┐
│  quality-gate   │  Type check, lint, unit tests
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│      e2e        │  Playwright browser tests (conditional)
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌────────┐
│deploy- │ │deploy- │
│  prod  │ │  dev   │
│(main)  │ │(feature│
└────────┘ │branches│
           └────────┘

SELF-HOSTED RUNNER:
├── No billing limits
├── No queue delays
├── Full control over environment
└── Docker layer cache persisted

═══════════════════════════════════════════════════════════════

░▒▓ Docker Architecture

Multi-Stage Build Strategy
DOCKERFILE (Main Application) - Multi-Stage
═══════════════════════════════════════════════════════════════

Stage 1: Dependencies
├── FROM node:22-alpine AS deps
├── Copy package*.json
└── RUN npm ci

Stage 2: Builder
├── FROM node:22-alpine AS builder
├── Copy from deps
├── Copy source
└── RUN npm run build

Stage 3: Runner (Production)
├── FROM node:22-alpine AS runner
├── ENV NODE_ENV=production
├── Copy .next/standalone
├── Copy .next/static
└── CMD ["node", "server.js"]

───────────────────────────────────────────────────────────────

DOCKERFILE.WORKER (Fast builds for workers)
═══════════════════════════════════════════════════════════════

FROM node:22-alpine
├── Copy package*.json
├── RUN npm ci --only=production
├── Copy src/
└── CMD ["npx", "tsx", "src/workers/gap-analysis.ts"]

Result: Workers no longer require full Next.js build
        Build time: 7+ minutes → ~30 seconds (14x faster)

═══════════════════════════════════════════════════════════════

░▒▓ Service Management

systemd Services
PRODUCTION SERVICES
═══════════════════════════════════════════════════════════════

systemd Services:
├── app.service              # Main Next.js application
├── worker-documents.service # Document processing
├── worker-gap.service       # Gap analysis
├── worker-competency.service# Competency gap detection
├── worker-rfq.service       # RFQ processor
├── worker-email.service     # Email notifications
└── worker-compliance.service# Compliance alerts

SERVICE CONFIGURATION:
┌────────────────────────────────────────────────────┐
│ [Unit]                                             │
│ Description=Enterprise SaaS Application           │
│ After=network.target                              │
│                                                    │
│ [Service]                                          │
│ Type=simple                                        │
│ User=ubuntu                                        │
│ WorkingDirectory=/home/ubuntu/app                 │
│ ExecStart=/usr/bin/node server.js                 │
│ Restart=always                                     │
│ RestartSec=10                                      │
│ Environment=NODE_ENV=production                   │
│                                                    │
│ [Install]                                          │
│ WantedBy=multi-user.target                        │
└────────────────────────────────────────────────────┘

═══════════════════════════════════════════════════════════════

░▒▓ Tech Stack

CI/CD

GitHub Actions
Workflow automation
Self-hosted Runner
Custom build environment

Containers

Docker
Multi-stage builds
BuildKit
Layer caching

Process Mgmt

systemd
Service management
Bash
Deployment scripts

Monitoring

Health Checks
Endpoint verification
curl
HTTP testing

░▒▓ Key Achievements

01

+26% CI/CD Success Rate in 11 Days

Improved pipeline reliability from ~47% to ~73% through systematic debugging and infrastructure improvements.

02

Eliminated Billing Failures

Migrated to self-hosted runner, removing all billing-related failures and queue delays.

03

14x Faster Worker Builds

Reduced worker build times from 7+ minutes to ~30 seconds with dedicated Dockerfile.worker.

04

70% Faster Main Builds

Implemented Docker layer caching reducing main build times from ~10 minutes to ~3 minutes.

05

7 systemd Services Deployed

Production services with health checks, auto-restart, and multi-environment support.

░▒▓ What I Learned

>>

Analyze Before Fixing

Categorizing ~70 failures revealed that 40% were infrastructure, not code. Without analysis, would have focused on wrong problems.

::

Docker Layer Caching Matters

BuildKit cache mounts transformed build times. Small config change, massive impact.

/\

Self-Hosted Runners Are Worth It

Full control, no billing limits, persistent caches. Setup cost paid off immediately.

$$

Isolate Flaky Tests

Flaky tests blocking deploys hurt more than they help. Disable, fix separately, re-enable.