When a SaaS startup came to us complaining about 45-minute CI/CD pipeline runs blocking deployments, we knew we had a classic optimization problem. The team was running a Node.js application with 2,000+ tests, heavy Docker builds, and a monolithic pipeline that ran everything on every commit. Developers were frustrated, deployments were delayed, and the velocity was suffering. This is a common challenge we see in our infrastructure audits.
After a comprehensive audit and systematic optimization, we reduced their pipeline time from 45 minutes to just 6 minutes - an 87% improvement. More importantly, we eliminated flaky tests, reduced CI costs by 60%, and enabled the team to ship code 7x faster. Similar improvements are detailed in our scaling case study and CI/CD security guide.
Real Impact Delivered
๐ 1. Identifying the Bottlenecks
Before optimizing anything, we needed to understand where time was actually being spent. The team assumed tests were the problem, but our profiling revealed a different story.
Time-Boxed Profiling of Each Pipeline Stage
We instrumented their GitHub Actions workflow to measure execution time for each stage:

# Before optimization - Stage breakdown
Install Dependencies: 8 minutes (18%)
Build Docker Image: 12 minutes (27%)
Run Unit Tests: 15 minutes (33%)
Run Integration Tests: 8 minutes (18%)
Deploy to Staging: 2 minutes (4%)
Total: 45 minutesThe results were surprising: Docker builds and test execution were the main culprits, not dependency installation as they suspected. For teams using GitHub Actions or other CI/CD platforms, proper job orchestration is critical for performance.
Measuring Queue Times vs. Execution Times
We discovered that runners were spending significant time waiting in queues. On average:
- Queue time: 3-5 minutes per job (shared runners)
- Actual execution: 40 minutes
- Total wall-clock time: 45 minutes
This queue time was invisible to developers but added up to 10-15% overhead. Moving to self-hosted runners eliminated this entirely.
Detecting Flaky Tests, Redundant Jobs, Slow Containers, and Heavy Dependencies
Our analysis uncovered several hidden issues:
- Flaky tests: 12 tests failing randomly 15-20% of the time, causing re-runs
- Redundant jobs: Running linting, type-checking, and tests separately when they could be parallelized
- Slow containers: Using `node:16` base image (800MB) instead of `node:16-alpine` (120MB)
- Heavy dependencies: Installing all dev dependencies including unused packages (2,400 packages total)
โ๏ธ 2. Reducing Build & Test Times
Implementing Incremental Builds / Remote Caching
For Node.js applications, we implemented several caching strategies:
npm Cache Restoration
# GitHub Actions example
- name: Cache node modules
uses: actions/cache@v3
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-This reduced dependency installation from 8 minutes to 30 seconds on cache hits (95% of builds).
Docker Layer Caching
We implemented Docker BuildKit cache mounts and layer caching:
# Dockerfile optimization
FROM node:16-alpine AS dependencies
WORKDIR /app
COPY package*.json./
RUN npm ci --only=production && \
npm cache clean --force
FROM node:16-alpine AS builder
WORKDIR /app
COPY package*.json./
RUN npm ci
COPY..
RUN npm run build
FROM node:16-alpine
WORKDIR /app
COPY --from=dependencies /app/node_modules./node_modules
COPY --from=builder /app/dist./dist
CMD ["node", "dist/index.js"]Layer caching reduced Docker build time from 12 minutes to 2-3 minutes when only application code changed.
Parallelizing Test Execution
Jest supports parallel execution out of the box, but the team wasn't leveraging it effectively. We configured:
// jest.config.js
module.exports = {
maxWorkers: '50%', // Use half of available CPUs
testTimeout: 10000,
// Split tests by type
projects: [
{
displayName: 'unit',
testMatch: ['/src/**/*.test.js'],
},
{
displayName: 'integration',
testMatch: ['/tests/**/*.test.js'],
},
],
}; We also split the test suite across multiple jobs using Jest's `--testPathPattern`:
# Run tests in parallel across 4 jobs
- name: Run unit tests (shard 1/4)
run: npm test -- --testPathPattern="src/.*" --shard=1/4
- name: Run unit tests (shard 2/4)
run: npm test -- --testPathPattern="src/.*" --shard=2/4This reduced test execution from 15 minutes to 4 minutes by running 4 test jobs in parallel.
Replacing Slow Test Frameworks or Optimizing Test Setup/Teardown
The team was using a heavy E2E testing framework (Cypress) for integration tests that required spinning up a full browser. We:
- Replaced browser-based tests with API-level integration tests using Supertest (10x faster)
- Moved critical E2E tests to a separate nightly job
- Optimized database setup/teardown by using transactions instead of full migrations
Using Container Pre-warming or Optimized Base Images
We switched from `node:16` (800MB) to `node:16-alpine` (120MB), reducing:
- Image pull time: 45 seconds โ 8 seconds
- Build context size: 1.2GB โ 180MB
- Overall build time: 12 minutes โ 8 minutes
For self-hosted runners, we pre-warmed containers with base images, eliminating pull time entirely.
๐ฆ 3. Dependency Optimization
Caching Package Managers Effectively (npm)
We implemented a multi-layer caching strategy:
- npm cache: Cached `~/.npm` directory based on `package-lock.json` hash
- node_modules cache: Cached `node_modules` directory for faster restores
- Docker layer cache: Cached npm install layer in Docker builds
# Multi-layer npm caching
- name: Get npm cache directory
id: npm-cache
run: echo "dir=$(npm config get cache)" >> $GITHUB_OUTPUT
- name: Cache npm
uses: actions/cache@v3
with:
path: ${{ steps.npm-cache.outputs.dir }}
key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-npm-Eliminating Unused Dependencies
We audited the `package.json` and found:
- 127 unused dependencies (installed but never imported)
- 45 duplicate packages (different versions of the same library)
- 23 deprecated packages with security vulnerabilities
Using `depcheck` and `npm-check`, we removed 195 unnecessary packages, reducing:
- Installation time: 8 minutes โ 5 minutes
- Docker image size: 1.2GB โ 850MB
- Security surface: 23 fewer vulnerable packages
Pinning Versions to Avoid Repeated Resolution Delays
The team was using version ranges (`^1.2.3`) which caused npm to resolve versions on every install. We:
- Locked all versions in `package.json` to exact versions
- Used `package-lock.json` consistently (already in place)
- Enabled `npm ci` instead of `npm install` for deterministic installs
This eliminated version resolution time (30-60 seconds) and ensured consistent builds across environments.
Introducing Artifact Repositories for Reusability
We set up a private npm registry (using GitHub Packages) to:
- Cache internal packages locally
- Share built artifacts across pipelines
- Reduce external npm registry calls
For Docker images, we used GitHub Container Registry with layer caching, reducing image push/pull times by 40%.
๐งช 4. Eliminating Flaky & Redundant Tests
Logging and Tracking Failure Patterns
We implemented test result tracking to identify flaky tests:
// test-reporter.js
const fs = require('fs');
afterEach(() => {
if (this.currentTest.state === 'failed') {
const testData = {
name: this.currentTest.title,
file: this.currentTest.file,
timestamp: new Date().toISOString(),
error: this.currentTest.err?.message,
};
// Log to file for analysis
fs.appendFileSync('test-failures.jsonl', JSON.stringify(testData) + '\n');
}
});Over 2 weeks, we identified 12 consistently flaky tests. Common causes:
- Race conditions in async tests (5 tests)
- Time-dependent assertions without mocking (3 tests)
- Shared test state between tests (2 tests)
- External API dependencies (2 tests)
Separate Critical Tests from Long-Running Optional Ones
We reorganized the test suite into three tiers:
- Fast unit tests: Run on every commit (500 tests, 2 minutes)
- Integration tests: Run on PRs and main branch (800 tests, 5 minutes)
- E2E tests: Run nightly or on release tags (700 tests, 20 minutes)
This meant developers got feedback in 2 minutes instead of 23 minutes for most changes.
Delete or Refactor Duplicate Test Coverage
We found significant test duplication:
- 45 tests covering the same functionality with different names
- 23 tests that were superseded by newer, better tests
- 12 tests testing implementation details instead of behavior
We removed 80 redundant tests, reducing test execution time by 3 minutes without losing coverage.
Run Heavy Integration Tests Only When Relevant Files Change
We implemented path-based test execution using `jest-changed-files`:
# Only run integration tests if API or database code changed
- name: Check changed files
id: changed-files
uses: tj-actions/changed-files@v35
with:
files: |
src/api/**
src/database/**
tests/integration/**
- name: Run integration tests
if: steps.changed-files.outputs.any_changed == 'true'
run: npm run test:integrationThis reduced integration test runs by 70%, only executing when API or database code actually changed.
๐ฑ 5. Intelligent Job Triggering
Moving Away from "Run Everything on Every Commit"
The biggest win came from conditional job execution. We implemented path-based triggers:
#.github/workflows/ci.yml
name: CI
on:
push:
branches: [main, develop]
pull_request:
branches: [main, develop]
jobs:
changes:
runs-on: ubuntu-latest
outputs:
docs: ${{ steps.filter.outputs.docs }}
frontend: ${{ steps.filter.outputs.frontend }}
backend: ${{ steps.filter.outputs.backend }}
steps:
- uses: actions/checkout@v3
- uses: dorny/paths-filter@v2
id: filter
with:
filters: |
docs:
- 'docs/**'
- '*.md'
frontend:
- 'frontend/**'
backend:
- 'src/**'
- 'server/**'
test-backend:
needs: changes
if: needs.changes.outputs.backend == 'true'
runs-on: ubuntu-latest
steps:
- name: Run backend tests
run: npm testAdding Path-Based Triggers or Conditional Workflows
We created separate workflow files for different change types:
ci-docs.yml: Only runs on documentation changes (linting, spell-check)ci-frontend.yml: Runs frontend tests and buildsci-backend.yml: Runs backend tests and API testsci-full.yml: Runs everything (only on main branch merges)
Skipping Jobs for Docs-Only or Non-Critical Changes
For documentation-only changes, we skip all tests:
- name: Skip CI for docs
if: contains(github.event.head_commit.message, '[skip ci]') ||
steps.changed-files.outputs.docs == 'true'
run: |
echo "Skipping CI for documentation changes"
exit 0This reduced CI runs by 25% (many PRs were just README or comment updates).
Using Commit Message Tags like [skip ci]
We implemented commit message parsing to skip CI:
[skip ci]: Skip all CI jobs[skip tests]: Skip test jobs, run only linting[ci fast]: Run only fast unit tests
This gave developers control over CI execution for non-critical changes.
๐งต 6. Pipeline Architecture Redesign
Breaking a Monolithic Pipeline into Smaller, Modular Workflows
The original pipeline was a single 45-minute job. We split it into:
Before: Single Job (45 min)
Install โ Build โ Test โ Deploy
After: Parallel Jobs (6 min total)
Job 1: Lint (1 min) | Job 2: Unit Tests (2 min) | Job 3: Build (3 min)
โ
Job 4: Integration Tests (4 min) | Job 5: Deploy (2 min)
Introducing Fan-Out/Fan-In Stages
We implemented a fan-out pattern for tests:
jobs:
test-matrix:
strategy:
matrix:
shard: [1, 2, 3, 4]
runs-on: ubuntu-latest
steps:
- name: Run test shard
run: npm test -- --shard=${{ matrix.shard }}/4
test-aggregate:
needs: test-matrix
runs-on: ubuntu-latest
steps:
- name: Aggregate test results
run: npm run test:coverage:mergeThis allowed 4 test jobs to run in parallel, reducing test time from 15 minutes to 4 minutes.
Shifting More Logic from CI to Local Pre-commit Hooks
We moved fast checks to pre-commit hooks using Husky:
//.husky/pre-commit
#!/usr/bin/env sh. "$(dirname -- "$0")/_/husky.sh"
# Fast checks that run locally
npm run lint:staged
npm run type-check
npm run test:unit:changedThis caught 80% of issues before they reached CI, reducing CI failures and re-runs.
Adopting a Matrix Build Strategy
For testing across Node.js versions, we used matrix builds:
strategy:
matrix:
node-version: [16.x, 18.x, 20.x]
os: [ubuntu-latest, windows-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}This ran tests in parallel across 6 combinations (3 Node versions ร 2 OS), completing in the time of a single test run.
๐ 7. Introducing Caching & Artifacts
Layer Caching for Docker Builds
We implemented Docker BuildKit cache mounts:
# docker-compose.yml or build command
DOCKER_BUILDKIT=1 docker build \
--cache-from type=registry,ref=ghcr.io/org/app:latest \
--cache-from type=local,src=/tmp/.buildx-cache \
--cache-to type=local,dest=/tmp/.buildx-cache \
-t app:latest.This cached Docker layers between builds, reducing build time from 12 minutes to 2-3 minutes on cache hits.
Sharing Build Artifacts Across Jobs Instead of Re-building
We used GitHub Actions artifacts to share build outputs:
# Build job
- name: Build application
run: npm run build
- name: Upload build artifacts
uses: actions/upload-artifact@v3
with:
name: dist
path: dist/
# Test job (uses built artifacts)
- name: Download build artifacts
uses: actions/download-artifact@v3
with:
name: dist
- name: Run tests against built artifacts
run: npm testThis eliminated duplicate builds and ensured tests ran against the exact code that would be deployed.
Caching Test Databases or Pre-computed Assets
For integration tests requiring databases, we:
- Used Docker Compose to spin up test databases (PostgreSQL, Redis)
- Cached database initialization scripts
- Used test fixtures instead of seeding fresh data each time
This reduced database setup time from 2 minutes to 10 seconds.
Using Persistent Runners for Warm Caches
We migrated from GitHub-hosted runners to self-hosted runners with:
- Persistent Docker layer cache
- Pre-installed Node.js and common dependencies
- Warm npm cache
- No queue time
This provided consistent performance and eliminated the 3-5 minute queue times.
๐๏ธ 8. Hardware & Runner Improvements
Moving to More Powerful or Dedicated Runners
We replaced GitHub's standard runners (2 vCPUs, 7GB RAM) with self-hosted runners:
- CPU: 8 vCPUs (4x improvement)
- RAM: 16GB (2.3x improvement)
- Storage: NVMe SSD (10x faster I/O)
This reduced test execution time by 40% due to faster CPU and I/O.
Switching from Shared SaaS Runners to Self-Hosted
Benefits of self-hosted runners:
- No queue time: Immediate job execution
- Persistent caches: Docker layers, npm cache survive between runs
- Custom configuration: Pre-installed tools, optimized for our stack
- Cost savings: $0.008/minute vs. $0.08/minute for GitHub-hosted
We set up runners using GitHub Actions Runner on AWS EC2 instances with auto-scaling.
Leveraging Autoscaling Runners for Parallel Workloads
We implemented autoscaling using actions-runner-controller on Kubernetes:
# runner-deployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: github-runner
spec:
replicas: 1
template:
spec:
repository: org/repo
minReplicas: 1
maxReplicas: 10
scaleUpTriggers:
- amount: 1
duration: "1m"This automatically scaled runners based on queue depth, handling 10 parallel jobs during peak times.
๐ 9. Measuring Improvements & Feedback Loop
Setting Up Dashboards for Build Time, Test Flakiness, Queue Depth
We created a CI/CD metrics dashboard using:
- GitHub Actions API: Track build times, success rates
- Prometheus: Export CI metrics
- Grafana: Visualize trends and alerts
Key metrics tracked:
- Average build time (target: < 10 minutes)
- Test flakiness rate (target: < 1%)
- Queue depth and wait times
- Cache hit rates (target: > 80%)
- Cost per build
Tracking Performance Before/After Changes
We maintained a performance log:

Performance Improvement Timeline
- Baseline: 45 minutes (100%)
- After caching: 32 minutes (71%) - 29% improvement
- After parallelization: 18 minutes (40%) - 60% improvement
- After path-based triggers: 12 minutes (27%) - 73% improvement
- After self-hosted runners: 6 minutes (13%) - 87% improvement
Adding Alerts When Pipelines Degrade Again
We set up alerts for:
- Build time exceeding 10 minutes (p95)
- Test flakiness rate above 2%
- Cache hit rate below 70%
- Queue wait time above 2 minutes
These alerts helped catch regressions immediately, preventing the pipeline from slowly degrading over time.
Regular Reviews to Remove Accumulated "CI Debt"
We instituted monthly CI/CD reviews to:
- Remove unused jobs or workflows
- Update dependencies and base images
- Review and optimize slow tests
- Clean up old artifacts and caches
- Review CI costs and identify optimization opportunities
This prevented the accumulation of "CI debt" that slows pipelines over time.
๐ง 10. Cultural & Workflow Improvements
Encouraging Smaller Pull Requests
Large PRs mean longer CI runs and more context for reviewers. We:
- Set PR size limits (max 400 lines changed)
- Encouraged feature flags for incremental delivery
- Provided templates for breaking large changes into smaller PRs
Smaller PRs meant faster CI feedback (2-4 minutes vs. 6+ minutes) and faster code reviews.
Enforcing Code-Review SLAs to Reduce Queueing
We implemented review SLAs:
- First review within 4 hours during business hours
- Auto-assign reviewers based on file paths
- Reminder bots for stale PRs
This reduced PR queue time and prevented CI resources from being tied up by unreviewed PRs.
Educating the Team on Writing Efficient Tests
We conducted workshops on:
- Writing fast, isolated unit tests
- Avoiding slow I/O operations in tests
- Using mocks and stubs effectively
- Test organization and naming conventions
This cultural change led to developers writing faster tests from the start, preventing future CI slowdowns.
Documenting Best Practices for Long-Term Consistency
We created comprehensive documentation:
- CI/CD Playbook: How to add new jobs, configure caching, etc.
- Test Guidelines: When to write unit vs. integration vs. E2E tests
- Performance Budgets: Maximum allowed times for each job type
- Onboarding Guide: How new developers can contribute without breaking CI
This ensured the improvements were sustainable and new team members followed best practices.
Results Summary
Complete Transformation Results
Key Takeaways
Optimizing CI/CD isn't about a single silver bullet - it's about systematic improvements across multiple dimensions:
- Measure first: You can't optimize what you don't measure. Profile every stage.
- Cache aggressively: npm, Docker layers, and build artifacts are your friends.
- Parallelize everything: Tests, builds, and jobs should run concurrently.
- Be selective: Don't run everything on every commit. Use path-based triggers.
- Invest in infrastructure: Self-hosted runners with proper hardware pay for themselves.
- Eliminate waste: Remove flaky tests, unused dependencies, and redundant jobs.
- Monitor continuously: Set up dashboards and alerts to catch regressions.
- Build culture: Educate the team and document best practices.
Conclusion
What started as a 45-minute pipeline blocking deployments became a 6-minute pipeline that enables rapid iteration. The transformation required systematic optimization across profiling, caching, parallelization, dependency management, test optimization, intelligent triggering, architecture redesign, and cultural improvements. For more cost optimization strategies, see our case studies on reducing infrastructure spend.
For Node.js applications specifically, the biggest wins came from:
- npm caching and dependency optimization (saved 5 minutes)
- Docker layer caching and optimized base images (saved 9 minutes)
- Test parallelization and path-based execution (saved 11 minutes)
- Self-hosted runners and hardware improvements (saved 4 minutes)
- Intelligent job triggering (saved 10 minutes by skipping unnecessary runs)
The result? A team that ships code 7x faster, spends 60% less on CI, and has zero flaky tests blocking deployments. This is the power of systematic CI/CD optimization.