CI/CD Troubleshooting Guide

Document Version: 1.0
Last Updated: 2025-01-01
Audience: Developers, Build Engineers, Maintainers

Overview

This guide covers common CI/CD issues in the Crankshaft project and their solutions. Issues are organized by workflow and include diagnosis steps, root causes, and fixes.

Top Issues Quick Reference

#	Issue	Symptom	Severity	Time to Fix
1	Quality gate failure	PR shows "Quality Gate: ✗"	Medium	5-15 min
2	Build timeout	Build job exceeds 45 min	High	10-30 min
3	Artifact not found	"Artifact not found" error	High	5-10 min
4	APT publish fails	"Repository verification failed"	High	15-30 min
5	GPG signing error	"GPG signature failed"	Critical	10-20 min
6	Release creation fails	"Release job failed"	Medium	10-20 min
7	Pi-Gen image won't boot	Image boots to black screen	High	30-60 min
8	Secret not available	"Secret X not found"	Critical	5-10 min
9	Workflow concurrency lock	"Workflow queued" indefinitely	Medium	5-15 min
10	Out of disk space	Build fails with "No space left"	Critical	10-20 min

Issue #1: Quality Gate Failure

Symptom

PR shows status: "Quality Gate: ✗"
PR comment contains code violations
Merge blocked until resolved

Root Causes

Code style violations (spacing, naming)
Potential memory leaks (detected by clang-tidy)
Security issues (detected by CodeQL)
Code duplication above threshold

Diagnosis Steps

Check PR comment
- Quality workflow posts detailed violation list
- Click "Show more" to expand violation details

Review violations by category

Category examples:
- clang-tidy: Modernisation, Performance, Readability
- cppcheck: Memory leak, null pointer, bounds check
- CodeQL: SQL injection, buffer overflow, etc.

Identify which check failed
- Each check shown separately in comment
- Look for "❌" mark to find failing checks

Solutions

For C++ code violations:

View detailed violation message:

# Click violation in PR comment
# Shows file:line:col and explanation

Fix locally:

# Format code to match style
cd crankshaft-mvp
./scripts/format_cpp.sh fix

# Run clang-tidy to find issues
./scripts/lint_cpp.sh clang-tidy

# Fix violations based on output

Verify fix:

# Push changes
git add .
git commit -m "Fix: Quality gate violations - #123"
git push origin feature/my-feature

# Wait 5 minutes for quality workflow
# Should show "Quality Gate: ✓"

Common violations and fixes:

Violation	Cause	Fix
`modernize-use-nullptr`	Using NULL instead of nullptr	Replace `NULL` with `nullptr`
`readability-named-parameter`	Unused parameter	Prefix with `/unused/` or remove
`performance-unnecessary-copy-initialization`	Unnecessary copy	Use reference or move semantics
`misc-unused-parameters`	Function parameter not used	Remove or prefix with `[[maybe_unused]]`

Prevention

Run quality checks locally before pushing:

./scripts/format_cpp.sh check
./scripts/lint_cpp.sh clang-tidy

Fix violations before push:
```
./scripts/format_cpp.sh fix
```

Issue #2: Build Timeout

Symptom

Build job shows "Workflow execution timed out"
Log shows incomplete build (stopped mid-compile)
Status: "Build: ✗"

Root Causes

Parallel jobs too numerous (building all architectures)
Slow Docker image build
Dependency compilation taking too long
Insufficient build cache
Large source code changes

Diagnosis Steps

Check job duration:
- Go to Actions → Platform Builds → [Run]
- Look at "Build job [architecture]" duration
- Timeout threshold: 45 minutes

Check build log for bottleneck:

# In GitHub Actions UI
Click job → Scroll to find slowest step
Look for step taking >10 minutes

Estimate remaining time:
- If 90% done at timeout: likely to succeed next attempt
- If 50% done: need optimization

Solutions

Immediate fix (retry):

# Via GitHub CLI
gh run rerun {run_id} --failed

# Via UI
Actions → [Run] → Re-run failed jobs

For repeated timeouts:

Reduce build scope for feature branches:
- Feature branches build amd64 only (default)
- Main branch builds all architectures
- Check: Are you on a feature branch?

Clear Docker cache (if Docker build slow):

# Via GitHub CLI
gh run list --branch main --status success --limit 1 | \
  xargs -I {} gh run view {} --json number | \
  jq '.run.number'

# Then use that run's cache

Check for large uncommitted files:

# Look for binary files in commit
git show HEAD --stat

# Remove if added accidentally
git reset HEAD~1 --soft
git reset HEAD <large-file>

Profile build locally:

# Time build on local machine
wsl
time ./scripts/build.sh --build-type Debug

# If >30 min locally, optimize code before pushing

Prevention

Keep feature branches focused (small changes)
Commit only necessary files (avoid binaries)
Test builds locally before pushing large changes

Issue #3: Artifact Not Found

Symptom

Downstream workflow (APT, Release) fails
Error: "Artifact 'build-artifacts-amd64' not found"
Status: "Build: ✗" or "APT Publish: ✗"

Root Causes

Build workflow failed (didn't produce artifacts)
Artifact retention period expired (30 days)
Wrong artifact name specified
Wrong branch trigger
Concurrency cancelled previous build

Diagnosis Steps

Check build workflow status:
- Go to Actions → Platform Builds
- Find the build that should produce artifacts
- Status should show "✓" (success)
Check artifact age:
- In build run, go to "Artifacts" tab
- Note creation timestamp
- Default retention: 30 days
Check artifact name:
- In build run, verify artifact exists
- Compare name with downstream workflow (apt.yml, release.yml)

Solutions

If build failed:

Review build log for compilation errors
Fix errors locally
Push fix to same branch
Build runs again automatically

If artifact expired:

Rebuild on target branch:

# Manual trigger
Actions → Platform Builds → Run workflow
Select branch and architectures

Extend retention period:

# In .github/workflows/build.yml
- uses: actions/upload-artifact@v4
  with:
    retention-days: 60  # Extend from 30 to 60

If wrong artifact name:

Verify artifact name in build workflow
Check downstream workflow references same name
Fix mismatch in YAML

Prevention

Monitor artifact retention dates
Don't rely on artifacts older than 2 weeks
For long-term storage, use GitHub Releases

Issue #4: APT Publish Fails

Symptom

APT workflow fails at "Publish to repository" step
Error: "Repository verification failed"
Error: "GPG signature invalid"
Status: "APT Publish: ✗"

Root Causes

GPG key issue (expired, missing passphrase)
Repository metadata corrupted
Duplicate package in repository
APT server connection failed
Insufficient disk space on APT server

Diagnosis Steps

Check GPT key status:

# Verify GPG secret exists
gh secret list --repo opencardev/crankshaft | grep GPG

# Should show: GPG_SIGNING_KEY, GPG_KEY_PASSPHRASE

Check APT repository:

# Test APT update
apt-get update

# If fails, repository metadata corrupted

Check for duplicate packages:

# List published packages
apt-cache search crankshaft

# Should show each package once per version

Solutions

GPG key issues:

Verify key is valid:

# Export key locally
gpg --list-secret-keys

# Check expiration
gpg --list-keys <key-id> --with-colon

Update expired key:

# Extend key expiration
gpg --edit-key <key-id>
# Type: expire
# Select: 2y (extend 2 years)
# Save

# Update secret in GitHub
gh secret set GPG_SIGNING_KEY < exported-key.txt

Verify passphrase:

# Test passphrase works
echo "test message" | gpg --detach-sign --passphrase "$PASS"

# If fails, update GPG_KEY_PASSPHRASE secret

Repository issues:

Rebuild repository from scratch:

# Via manual workflow
Actions → APT Repository → Run workflow

# Removes corrupted metadata and rebuilds

Remove duplicate packages:

# Find duplicate
apt-cache policy crankshaft-ui

# Remove older version
# (Contact APT maintainer)

Prevention

Monitor GPG key expiration (check quarterly)
Test APT updates on regular basis
Use apt-get clean after testing

Issue #5: GPG Signing Error

Symptom

Workflow fails at signing step
Error: "gpg: error while signing: unknown error"
Error: "gpg: failed to sign"
Status: "APT Publish: ✗" or "Release: ✗"

Root Causes

GPG key expired
GPG key passphrase incorrect
GPG binary not installed
Insufficient entropy (on headless system)
GPG agent timeout

Diagnosis Steps

Check GPG installation:

gpg --version

# Should output version and build info

Check key status:

gpg --list-secret-keys

# Should show key with [SCEA] flags
# Check "sec  rsa4096 ... [expired: ...]"

Test signing locally:

echo "test" | gpg --detach-sign --user <key-id>

# If fails, key issue confirmed

Solutions

Check passphrase:

# Verify passphrase works
echo "test message" | \
  gpg --batch --no-tty \
      --passphrase "$GPG_KEY_PASSPHRASE" \
      --detach-sign

# If fails: passphrase incorrect, update secret
gh secret set GPG_KEY_PASSPHRASE < <(echo "correct-passphrase")

Renew expired key:

# Get current key ID
KEY_ID=$(gpg --list-secret-keys --with-colon | \
         grep "^sec" | cut -d: -f5 | head -1)

# Extend expiration
gpg --batch --no-tty --default-key "$KEY_ID" \
    --quick-set-expire "$KEY_ID" 2y

# Export and update secret
gpg --export-secret-keys "$KEY_ID" | \
    gh secret set GPG_SIGNING_KEY

For headless systems (Linux, CI):

# Generate entropy (if system slow)
# Create /dev/urandom-fed process
rngd -r /dev/urandom

# Or skip entropy generation in GPG
echo "pinentry-program /usr/bin/pinentry-curses" >> ~/.gnupg/gpg-agent.conf
gpgconf --kill gpg-agent

Prevention

Monitor key expiration dates
Refresh keys quarterly
Test signing in release preview phase

Issue #6: Release Creation Fails

Symptom

Release workflow fails
Error: "Release creation failed"
Error: "Tag not found"
Status: "Release: ✗"

Root Causes

Build artifacts missing (Issue #3)
Tag doesn't exist or not pushed
Release already exists
GitHub API rate limit exceeded
Insufficient permissions

Diagnosis Steps

Check if tag exists:

git tag | grep v1.2.3

# If not found: tag not pushed
git push origin v1.2.3

Check if release exists:

gh release view v1.2.3

# If shows "release not found", proceed with creation
# If shows details, release already exists

Check build artifacts:

# Find build run ID from logs
gh run list --branch main --status success --limit 1

# Check for artifacts
gh run view <build-run-id> --json artifacts

Solutions

If tag missing:

# Create and push tag
git tag v1.2.3
git push origin v1.2.3

# Workflow automatically triggers, creates release

If release already exists (updating):

# Delete old release and tag
gh release delete v1.2.3 --yes
git push origin :refs/tags/v1.2.3

# Recreate
git tag v1.2.3
git push origin v1.2.3

If using manual release mode:

# Go to Actions → Release → Run workflow
# Enter build-run-id from successful build
# Leave tag field empty

# Workflow uses existing artifacts without rebuilding

Prevention

Tag production releases with semantic versioning (v1.2.3)
Use git push origin --tags to ensure tags pushed
Verify release in GitHub UI before announcing

Issue #7: Pi-Gen Image Won't Boot

Symptom

Image written to SD card
Raspberry Pi 4 boots, shows black screen
No SSH access, no console output
Status: "Pi-Gen Images: ✓" (workflow succeeded)

Root Causes

Wrong APT channel used (nightly vs stable)
Kernel version incompatible with RPi4
Device tree blob missing or corrupt
Root filesystem not extracted properly
SD card not fully written

Diagnosis Steps

Check which image used:

# Look at Pi-Gen workflow run
Actions → Pi-Gen Images → [Run]

Check input: apt_channel (stable/nightly)
Check input: image_types (lite/full)

Verify SD card write:

# On Linux/WSL
lsblk  # List drives
sudo dd if=image.img of=/dev/sdX bs=4M status=progress
sync   # Ensure written

Check for errors during build:

# In workflow logs
Look for "ERROR" or "FAILED" messages
Check "Boot verification" step

Solutions

Use stable APT channel:

# Retry Pi-Gen build
Actions → Pi-Gen Images → Run workflow
Select apt_channel: stable (not nightly)

# Wait for completion, rewrite SD card

Verify image file integrity:

# Check SHA256 (from workflow artifacts)
sha256sum image.img
# Compare with artifacts/image.img.sha256

# If mismatch: download again and retry

Test with lite image first:

# Lite image more reliable than full
# Download lite image from workflow artifacts
# Write to SD card
# Boot and test connectivity:
ssh -i /path/to/key pi@<pi-ip>

Enable HDMI debugging:

# Create config.txt on SD card boot partition
hdmi_force_hotplug=1
hdmi_drive=2

Prevention

Always test on Raspberry Pi 4 (same hardware)
Use stable APT channel for production
Keep SD card with known good image for testing
Monitor Pi-Gen build logs for warnings

Issue #8: Secret Not Available

Symptom

Workflow fails accessing secret
Error: "Secret 'X' not found"
Error: "Null or empty value"
Status: "Job failed"

Root Causes

Secret not created in repository
Secret not inherited from organization
Workflow using wrong secret name
Secret deleted or renamed
Access token expired

Diagnosis Steps

Check if secret exists:

gh secret list --repo opencardev/crankshaft

# Compare with secret name in workflow YAML
# Check spelling and case

Check secret scope:

# Repository secrets
gh secret list --repo opencardev/crankshaft

# Organization secrets
gh secret list --org opencardev

Check workflow variable reference:

# Correct syntax
- name: Use secret
  run: echo ${{ secrets.MY_SECRET }}

# Incorrect (missing 'secrets.' prefix)
run: echo ${{ MY_SECRET }}

Solutions

Create missing secret:

# Option 1: GitHub CLI
gh secret set MY_SECRET < <(echo "secret-value")

# Option 2: GitHub UI
Settings → Secrets and variables → Actions → New repository secret

# Then reference in workflow
env:
  MY_SECRET: ${{ secrets.MY_SECRET }}

Check secret value:

# Can't view secret value, but can verify creation
gh secret list

# Should show "MY_SECRET" in list

Fix workflow reference:

# Wrong
- run: echo ${{ MY_SECRET }}

# Correct
- run: echo ${{ secrets.MY_SECRET }}

# Or use env
env:
  SOME_VAR: ${{ secrets.MY_SECRET }}
- run: echo $SOME_VAR

Prevention

Document all required secrets in README
Validate secret names match workflow references
Periodically audit secrets (remove unused)

Issue #9: Workflow Concurrency Lock

Symptom

Workflow appears in "Queued" state indefinitely
Status shows yellow dot (pending)
Not progressing even after 30+ minutes
Other workflows on same concurrency group are running

Root Causes

Previous workflow not cancelled properly
Concurrency group limits too restrictive
Zombie workflow stuck in running state
Race condition in concurrency key

Diagnosis Steps

Check concurrency group:

# View workflow
Actions → [Workflow] → [Run]

Look for "Concurrency group: build-main" (example)

Check other runs in group:

# Via GitHub CLI
gh run list --status in_progress | grep "build-main"

# Should show only one run at a time

Check run duration:

# If previous run >2 hours, likely stuck
gh run view <run-id> --json startedAt,completedAt

Solutions

Cancel stuck workflow:

# Via GitHub CLI
gh run cancel <stuck-run-id>

# Queued workflow should start immediately

Monitor concurrency:

# Check if one branch blocks others
Actions → [Workflow] → [Recent runs]

# If many "Queued" with one "In progress"
# Concurrency too restrictive, consider adjusting

Increase parallel jobs:

# In workflow YAML, if appropriate
concurrency:
  group: build-${{ github.ref }}
  cancel-in-progress: true

# And increase max runners
jobs:
  build-amd64:
    runs-on: [self-hosted, amd64]
  build-arm64:
    runs-on: [self-hosted, arm64]

Prevention

Monitor workflow queue regularly
Set reasonable timeouts (45 min for builds)
Use cancel-in-progress: true to clean up old runs

Issue #10: Out of Disk Space

Symptom

Build fails with "No space left on device"
Docker build fails: "Write failed"
Artifact upload fails
Status: "Build: ✗"

Root Causes

Docker images accumulating (no cleanup)
Build artifacts from previous runs
Temporary build files not cleaned
Insufficient disk on runner

Diagnosis Steps

Check disk usage (on build runner):

df -h

# Check if root (/) or /var is full
du -sh /*

Check Docker usage:

docker system df

# Should show images, containers, volumes

Check temporary files:

du -sh /tmp /var/tmp ~/.cache

# Large directories indicate cleanup needed

Solutions

Clean Docker images:

# Remove dangling images
docker image prune -f

# Remove unused images older than 72 hours
docker image prune -f --filter "until=72h"

Clean build artifacts:

# Remove old GitHub Actions artifacts
cd ~/work

# Find and remove old directories (>7 days)
find . -type d -ctime +7 -exec rm -rf {} +

Extend disk space:

# For GitHub-hosted runners: auto-cleanup handled

# For self-hosted runners:
# 1. Add disk space to system
# 2. Or add cache cleanup job to workflow

# In workflow YAML
- name: Clean up Docker
  if: always()
  run: |
    docker system prune -f --all
    docker volume prune -f

Add cleanup step to workflow:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # ... build steps ...
      
      - name: Cleanup
        if: always()
        run: |
          du -sh ~/.cache
          rm -rf ~/.cache/pip
          docker system prune -f

Prevention

Add cleanup steps to all long-running workflows
Monitor disk usage weekly on self-hosted runners
Set GitHub Actions artifact retention to 7-14 days

Getting Help

Information to Include When Reporting Issues

Workflow name and run ID

gh run list --repo opencardev/crankshaft | head -5

Full workflow log
```
gh run view <run-id> --log > run.log
```

System information

uname -a
docker --version
cmake --version

Recent commits
```
git log --oneline -5
```

Resources

Workflow guide: docs/ci-cd/workflow-guide.md
Architecture decisions: docs/ci-cd/architecture-decisions.md
Developer handbook: docs/ci-cd/developer-handbook.md
GitHub Actions docs: https://docs.github.com/en/actions
Community issues: https://github.com/opencardev/crankshaft/issues

Common Support Channels

GitHub Issues: Report bugs, request features
GitHub Discussions: Ask questions, share knowledge
Slack/Discord: Real-time chat (if enabled)
Email: Submit detailed bug reports

Document History

Version	Date	Changes
1.0	2025-01-01	Initial version with top 10 issues
1.1	2025-01-XX	Added GPU/ARM runner specifics (pending)

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

CI/CD Troubleshooting Guide

Overview

Top Issues Quick Reference

Issue #1: Quality Gate Failure

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #2: Build Timeout

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #3: Artifact Not Found

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #4: APT Publish Fails

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #5: GPG Signing Error

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #6: Release Creation Fails

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #7: Pi-Gen Image Won't Boot

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #8: Secret Not Available

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #9: Workflow Concurrency Lock

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Issue #10: Out of Disk Space

Symptom

Root Causes

Diagnosis Steps

Solutions

Prevention

Getting Help

Information to Include When Reporting Issues

Resources

Common Support Channels

Document History