Document Version: 1.0
Last Updated: 2025-01-01
Audience: Developers, Build Engineers, Maintainers
This guide covers common CI/CD issues in the Crankshaft project and their solutions. Issues are organized by workflow and include diagnosis steps, root causes, and fixes.
| # | Issue | Symptom | Severity | Time to Fix |
|---|---|---|---|---|
| 1 | Quality gate failure | PR shows "Quality Gate: ✗" | Medium | 5-15 min |
| 2 | Build timeout | Build job exceeds 45 min | High | 10-30 min |
| 3 | Artifact not found | "Artifact not found" error | High | 5-10 min |
| 4 | APT publish fails | "Repository verification failed" | High | 15-30 min |
| 5 | GPG signing error | "GPG signature failed" | Critical | 10-20 min |
| 6 | Release creation fails | "Release job failed" | Medium | 10-20 min |
| 7 | Pi-Gen image won't boot | Image boots to black screen | High | 30-60 min |
| 8 | Secret not available | "Secret X not found" | Critical | 5-10 min |
| 9 | Workflow concurrency lock | "Workflow queued" indefinitely | Medium | 5-15 min |
| 10 | Out of disk space | Build fails with "No space left" | Critical | 10-20 min |
- PR shows status: "Quality Gate: ✗"
- PR comment contains code violations
- Merge blocked until resolved
- Code style violations (spacing, naming)
- Potential memory leaks (detected by clang-tidy)
- Security issues (detected by CodeQL)
- Code duplication above threshold
-
Check PR comment
- Quality workflow posts detailed violation list
- Click "Show more" to expand violation details
-
Review violations by category
Category examples: - clang-tidy: Modernisation, Performance, Readability - cppcheck: Memory leak, null pointer, bounds check - CodeQL: SQL injection, buffer overflow, etc. -
Identify which check failed
- Each check shown separately in comment
- Look for "❌" mark to find failing checks
For C++ code violations:
-
View detailed violation message:
# Click violation in PR comment # Shows file:line:col and explanation
-
Fix locally:
# Format code to match style cd crankshaft-mvp ./scripts/format_cpp.sh fix # Run clang-tidy to find issues ./scripts/lint_cpp.sh clang-tidy # Fix violations based on output
-
Verify fix:
# Push changes git add . git commit -m "Fix: Quality gate violations - #123" git push origin feature/my-feature # Wait 5 minutes for quality workflow # Should show "Quality Gate: ✓"
Common violations and fixes:
| Violation | Cause | Fix |
|---|---|---|
modernize-use-nullptr |
Using NULL instead of nullptr | Replace NULL with nullptr |
readability-named-parameter |
Unused parameter | Prefix with /*unused*/ or remove |
performance-unnecessary-copy-initialization |
Unnecessary copy | Use reference or move semantics |
misc-unused-parameters |
Function parameter not used | Remove or prefix with [[maybe_unused]] |
- Run quality checks locally before pushing:
./scripts/format_cpp.sh check ./scripts/lint_cpp.sh clang-tidy
- Fix violations before push:
./scripts/format_cpp.sh fix
- Build job shows "Workflow execution timed out"
- Log shows incomplete build (stopped mid-compile)
- Status: "Build: ✗"
- Parallel jobs too numerous (building all architectures)
- Slow Docker image build
- Dependency compilation taking too long
- Insufficient build cache
- Large source code changes
-
Check job duration:
- Go to
Actions → Platform Builds → [Run] - Look at "Build job [architecture]" duration
- Timeout threshold: 45 minutes
- Go to
-
Check build log for bottleneck:
# In GitHub Actions UI Click job → Scroll to find slowest step Look for step taking >10 minutes
-
Estimate remaining time:
- If 90% done at timeout: likely to succeed next attempt
- If 50% done: need optimization
Immediate fix (retry):
# Via GitHub CLI
gh run rerun {run_id} --failed
# Via UI
Actions → [Run] → Re-run failed jobsFor repeated timeouts:
-
Reduce build scope for feature branches:
- Feature branches build amd64 only (default)
- Main branch builds all architectures
- Check: Are you on a feature branch?
-
Clear Docker cache (if Docker build slow):
# Via GitHub CLI gh run list --branch main --status success --limit 1 | \ xargs -I {} gh run view {} --json number | \ jq '.run.number' # Then use that run's cache
-
Check for large uncommitted files:
# Look for binary files in commit git show HEAD --stat # Remove if added accidentally git reset HEAD~1 --soft git reset HEAD <large-file>
-
Profile build locally:
# Time build on local machine wsl time ./scripts/build.sh --build-type Debug # If >30 min locally, optimize code before pushing
- Keep feature branches focused (small changes)
- Commit only necessary files (avoid binaries)
- Test builds locally before pushing large changes
- Downstream workflow (APT, Release) fails
- Error: "Artifact 'build-artifacts-amd64' not found"
- Status: "Build: ✗" or "APT Publish: ✗"
- Build workflow failed (didn't produce artifacts)
- Artifact retention period expired (30 days)
- Wrong artifact name specified
- Wrong branch trigger
- Concurrency cancelled previous build
-
Check build workflow status:
- Go to
Actions → Platform Builds - Find the build that should produce artifacts
- Status should show "✓" (success)
- Go to
-
Check artifact age:
- In build run, go to "Artifacts" tab
- Note creation timestamp
- Default retention: 30 days
-
Check artifact name:
- In build run, verify artifact exists
- Compare name with downstream workflow (apt.yml, release.yml)
If build failed:
- Review build log for compilation errors
- Fix errors locally
- Push fix to same branch
- Build runs again automatically
If artifact expired:
-
Rebuild on target branch:
# Manual trigger Actions → Platform Builds → Run workflow Select branch and architectures -
Extend retention period:
# In .github/workflows/build.yml - uses: actions/upload-artifact@v4 with: retention-days: 60 # Extend from 30 to 60
If wrong artifact name:
- Verify artifact name in build workflow
- Check downstream workflow references same name
- Fix mismatch in YAML
- Monitor artifact retention dates
- Don't rely on artifacts older than 2 weeks
- For long-term storage, use GitHub Releases
- APT workflow fails at "Publish to repository" step
- Error: "Repository verification failed"
- Error: "GPG signature invalid"
- Status: "APT Publish: ✗"
- GPG key issue (expired, missing passphrase)
- Repository metadata corrupted
- Duplicate package in repository
- APT server connection failed
- Insufficient disk space on APT server
-
Check GPT key status:
# Verify GPG secret exists gh secret list --repo opencardev/crankshaft | grep GPG # Should show: GPG_SIGNING_KEY, GPG_KEY_PASSPHRASE
-
Check APT repository:
# Test APT update apt-get update # If fails, repository metadata corrupted
-
Check for duplicate packages:
# List published packages apt-cache search crankshaft # Should show each package once per version
GPG key issues:
-
Verify key is valid:
# Export key locally gpg --list-secret-keys # Check expiration gpg --list-keys <key-id> --with-colon
-
Update expired key:
# Extend key expiration gpg --edit-key <key-id> # Type: expire # Select: 2y (extend 2 years) # Save # Update secret in GitHub gh secret set GPG_SIGNING_KEY < exported-key.txt
-
Verify passphrase:
# Test passphrase works echo "test message" | gpg --detach-sign --passphrase "$PASS" # If fails, update GPG_KEY_PASSPHRASE secret
Repository issues:
-
Rebuild repository from scratch:
# Via manual workflow Actions → APT Repository → Run workflow # Removes corrupted metadata and rebuilds
-
Remove duplicate packages:
# Find duplicate apt-cache policy crankshaft-ui # Remove older version # (Contact APT maintainer)
- Monitor GPG key expiration (check quarterly)
- Test APT updates on regular basis
- Use apt-get clean after testing
- Workflow fails at signing step
- Error: "gpg: error while signing: unknown error"
- Error: "gpg: failed to sign"
- Status: "APT Publish: ✗" or "Release: ✗"
- GPG key expired
- GPG key passphrase incorrect
- GPG binary not installed
- Insufficient entropy (on headless system)
- GPG agent timeout
-
Check GPG installation:
gpg --version # Should output version and build info -
Check key status:
gpg --list-secret-keys # Should show key with [SCEA] flags # Check "sec rsa4096 ... [expired: ...]"
-
Test signing locally:
echo "test" | gpg --detach-sign --user <key-id> # If fails, key issue confirmed
Check passphrase:
# Verify passphrase works
echo "test message" | \
gpg --batch --no-tty \
--passphrase "$GPG_KEY_PASSPHRASE" \
--detach-sign
# If fails: passphrase incorrect, update secret
gh secret set GPG_KEY_PASSPHRASE < <(echo "correct-passphrase")Renew expired key:
# Get current key ID
KEY_ID=$(gpg --list-secret-keys --with-colon | \
grep "^sec" | cut -d: -f5 | head -1)
# Extend expiration
gpg --batch --no-tty --default-key "$KEY_ID" \
--quick-set-expire "$KEY_ID" 2y
# Export and update secret
gpg --export-secret-keys "$KEY_ID" | \
gh secret set GPG_SIGNING_KEYFor headless systems (Linux, CI):
# Generate entropy (if system slow)
# Create /dev/urandom-fed process
rngd -r /dev/urandom
# Or skip entropy generation in GPG
echo "pinentry-program /usr/bin/pinentry-curses" >> ~/.gnupg/gpg-agent.conf
gpgconf --kill gpg-agent- Monitor key expiration dates
- Refresh keys quarterly
- Test signing in release preview phase
- Release workflow fails
- Error: "Release creation failed"
- Error: "Tag not found"
- Status: "Release: ✗"
- Build artifacts missing (Issue #3)
- Tag doesn't exist or not pushed
- Release already exists
- GitHub API rate limit exceeded
- Insufficient permissions
-
Check if tag exists:
git tag | grep v1.2.3 # If not found: tag not pushed git push origin v1.2.3
-
Check if release exists:
gh release view v1.2.3 # If shows "release not found", proceed with creation # If shows details, release already exists
-
Check build artifacts:
# Find build run ID from logs gh run list --branch main --status success --limit 1 # Check for artifacts gh run view <build-run-id> --json artifacts
If tag missing:
# Create and push tag
git tag v1.2.3
git push origin v1.2.3
# Workflow automatically triggers, creates releaseIf release already exists (updating):
# Delete old release and tag
gh release delete v1.2.3 --yes
git push origin :refs/tags/v1.2.3
# Recreate
git tag v1.2.3
git push origin v1.2.3If using manual release mode:
# Go to Actions → Release → Run workflow
# Enter build-run-id from successful build
# Leave tag field empty
# Workflow uses existing artifacts without rebuilding- Tag production releases with semantic versioning (v1.2.3)
- Use
git push origin --tagsto ensure tags pushed - Verify release in GitHub UI before announcing
- Image written to SD card
- Raspberry Pi 4 boots, shows black screen
- No SSH access, no console output
- Status: "Pi-Gen Images: ✓" (workflow succeeded)
- Wrong APT channel used (nightly vs stable)
- Kernel version incompatible with RPi4
- Device tree blob missing or corrupt
- Root filesystem not extracted properly
- SD card not fully written
-
Check which image used:
# Look at Pi-Gen workflow run Actions → Pi-Gen Images → [Run] Check input: apt_channel (stable/nightly) Check input: image_types (lite/full) -
Verify SD card write:
# On Linux/WSL lsblk # List drives sudo dd if=image.img of=/dev/sdX bs=4M status=progress sync # Ensure written
-
Check for errors during build:
# In workflow logs Look for "ERROR" or "FAILED" messages Check "Boot verification" step
Use stable APT channel:
# Retry Pi-Gen build
Actions → Pi-Gen Images → Run workflow
Select apt_channel: stable (not nightly)
# Wait for completion, rewrite SD cardVerify image file integrity:
# Check SHA256 (from workflow artifacts)
sha256sum image.img
# Compare with artifacts/image.img.sha256
# If mismatch: download again and retryTest with lite image first:
# Lite image more reliable than full
# Download lite image from workflow artifacts
# Write to SD card
# Boot and test connectivity:
ssh -i /path/to/key pi@<pi-ip>Enable HDMI debugging:
# Create config.txt on SD card boot partition
hdmi_force_hotplug=1
hdmi_drive=2- Always test on Raspberry Pi 4 (same hardware)
- Use stable APT channel for production
- Keep SD card with known good image for testing
- Monitor Pi-Gen build logs for warnings
- Workflow fails accessing secret
- Error: "Secret 'X' not found"
- Error: "Null or empty value"
- Status: "Job failed"
- Secret not created in repository
- Secret not inherited from organization
- Workflow using wrong secret name
- Secret deleted or renamed
- Access token expired
-
Check if secret exists:
gh secret list --repo opencardev/crankshaft # Compare with secret name in workflow YAML # Check spelling and case
-
Check secret scope:
# Repository secrets gh secret list --repo opencardev/crankshaft # Organization secrets gh secret list --org opencardev
-
Check workflow variable reference:
# Correct syntax - name: Use secret run: echo ${{ secrets.MY_SECRET }} # Incorrect (missing 'secrets.' prefix) run: echo ${{ MY_SECRET }}
Create missing secret:
# Option 1: GitHub CLI
gh secret set MY_SECRET < <(echo "secret-value")
# Option 2: GitHub UI
Settings → Secrets and variables → Actions → New repository secret
# Then reference in workflow
env:
MY_SECRET: ${{ secrets.MY_SECRET }}Check secret value:
# Can't view secret value, but can verify creation
gh secret list
# Should show "MY_SECRET" in listFix workflow reference:
# Wrong
- run: echo ${{ MY_SECRET }}
# Correct
- run: echo ${{ secrets.MY_SECRET }}
# Or use env
env:
SOME_VAR: ${{ secrets.MY_SECRET }}
- run: echo $SOME_VAR- Document all required secrets in README
- Validate secret names match workflow references
- Periodically audit secrets (remove unused)
- Workflow appears in "Queued" state indefinitely
- Status shows yellow dot (pending)
- Not progressing even after 30+ minutes
- Other workflows on same concurrency group are running
- Previous workflow not cancelled properly
- Concurrency group limits too restrictive
- Zombie workflow stuck in running state
- Race condition in concurrency key
-
Check concurrency group:
# View workflow Actions → [Workflow] → [Run] Look for "Concurrency group: build-main" (example)
-
Check other runs in group:
# Via GitHub CLI gh run list --status in_progress | grep "build-main" # Should show only one run at a time
-
Check run duration:
# If previous run >2 hours, likely stuck gh run view <run-id> --json startedAt,completedAt
Cancel stuck workflow:
# Via GitHub CLI
gh run cancel <stuck-run-id>
# Queued workflow should start immediatelyMonitor concurrency:
# Check if one branch blocks others
Actions → [Workflow] → [Recent runs]
# If many "Queued" with one "In progress"
# Concurrency too restrictive, consider adjustingIncrease parallel jobs:
# In workflow YAML, if appropriate
concurrency:
group: build-${{ github.ref }}
cancel-in-progress: true
# And increase max runners
jobs:
build-amd64:
runs-on: [self-hosted, amd64]
build-arm64:
runs-on: [self-hosted, arm64]- Monitor workflow queue regularly
- Set reasonable timeouts (45 min for builds)
- Use
cancel-in-progress: trueto clean up old runs
- Build fails with "No space left on device"
- Docker build fails: "Write failed"
- Artifact upload fails
- Status: "Build: ✗"
- Docker images accumulating (no cleanup)
- Build artifacts from previous runs
- Temporary build files not cleaned
- Insufficient disk on runner
-
Check disk usage (on build runner):
df -h # Check if root (/) or /var is full du -sh /*
-
Check Docker usage:
docker system df # Should show images, containers, volumes -
Check temporary files:
du -sh /tmp /var/tmp ~/.cache # Large directories indicate cleanup needed
Clean Docker images:
# Remove dangling images
docker image prune -f
# Remove unused images older than 72 hours
docker image prune -f --filter "until=72h"Clean build artifacts:
# Remove old GitHub Actions artifacts
cd ~/work
# Find and remove old directories (>7 days)
find . -type d -ctime +7 -exec rm -rf {} +Extend disk space:
# For GitHub-hosted runners: auto-cleanup handled
# For self-hosted runners:
# 1. Add disk space to system
# 2. Or add cache cleanup job to workflow
# In workflow YAML
- name: Clean up Docker
if: always()
run: |
docker system prune -f --all
docker volume prune -fAdd cleanup step to workflow:
jobs:
build:
runs-on: ubuntu-latest
steps:
# ... build steps ...
- name: Cleanup
if: always()
run: |
du -sh ~/.cache
rm -rf ~/.cache/pip
docker system prune -f- Add cleanup steps to all long-running workflows
- Monitor disk usage weekly on self-hosted runners
- Set GitHub Actions artifact retention to 7-14 days
-
Workflow name and run ID
gh run list --repo opencardev/crankshaft | head -5 -
Full workflow log
gh run view <run-id> --log > run.log
-
System information
uname -a docker --version cmake --version
-
Recent commits
git log --oneline -5
- Workflow guide:
docs/ci-cd/workflow-guide.md - Architecture decisions:
docs/ci-cd/architecture-decisions.md - Developer handbook:
docs/ci-cd/developer-handbook.md - GitHub Actions docs: https://docs.github.com/en/actions
- Community issues: https://github.com/opencardev/crankshaft/issues
- GitHub Issues: Report bugs, request features
- GitHub Discussions: Ask questions, share knowledge
- Slack/Discord: Real-time chat (if enabled)
- Email: Submit detailed bug reports
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-01-01 | Initial version with top 10 issues |
| 1.1 | 2025-01-XX | Added GPU/ARM runner specifics (pending) |