Skip to content

NixOS Custom Command System - Complete Overview

Introduction

The NixOS custom command system provides specialized, automated workflows for managing complex NixOS infrastructure. Each command embeds best practices, safety checks, GitHub integration, and comprehensive documentation into repeatable procedures.

Architecture

Command Structure

.claude/commands/
├── flake-update.md         # Flake input management
├── update-claude-code.md   # Claude Code package updates
├── system-health-check.md  # Infrastructure monitoring
├── update-package.md       # Generic package updates
├── deploy-all.md          # Multi-host deployment
└── config-audit.md        # Configuration quality audit

Command Format

Each command is a Markdown file containing:

  1. Task Overview: High-level description
  2. Prerequisites: Required checks before execution
  3. Step-by-Step Procedures: Detailed workflow
  4. Success Criteria: Validation checklist
  5. Rollback Procedures: Emergency recovery steps
  6. Troubleshooting: Common issues and solutions
  7. Documentation References: Links to relevant docs

Execution Flow

graph TD
    A[User invokes /command] --> B[Claude Code loads command]
    B --> C[Execute prerequisites check]
    C --> D{Prerequisites pass?}
    D -->|No| E[Report issues, halt]
    D -->|Yes| F[Execute main workflow]
    F --> G[Run validation checks]
    G --> H{Validation pass?}
    H -->|No| I[Report errors, rollback]
    H -->|Yes| J[Complete successfully]
    J --> K[Generate documentation]
    K --> L[Update GitHub if needed]

Command Catalog

1. /flake-update - Flake Input Management

Purpose: Safely update NixOS flake inputs with testing and deployment

When to Use:

  • Weekly maintenance schedule
  • After upstream security updates
  • When new features needed from nixpkgs
  • Before major infrastructure changes

Key Features:

  • Pre-update system checks
  • Parallel host testing
  • Smart deployment (only changed hosts)
  • Post-deployment verification
  • Automatic rollback on failure
  • Detailed commit documentation

Typical Duration: 10-15 minutes

Risk Level: Medium (automated testing reduces risk)

2. /system-health-check - Infrastructure Monitoring

Purpose: Comprehensive health assessment across all infrastructure

When to Use:

  • Weekly scheduled checks
  • After major deployments
  • When investigating issues
  • Before starting new projects

Key Features:

  • Multi-host connectivity testing
  • Service status verification
  • Resource utilization analysis
  • Monitoring stack validation
  • Media server health (Plex/NZBGet)
  • Security audit
  • Comprehensive report generation

Typical Duration: 5-10 minutes

Risk Level: None (read-only operations)

3. /update-claude-code - Claude Code Updates

Purpose: Update Claude Code package with proper testing

When to Use:

  • New release available
  • Bug fixes needed
  • Feature updates desired
  • Monthly maintenance

Key Features:

  • Automatic version research
  • Hash calculation
  • Multi-host testing
  • GitHub issue/PR creation
  • Deployment verification

Typical Duration: 8-12 minutes

Risk Level: Low (single package update)

4. /update-package - Generic Package Updates

Purpose: Update any package in the configuration

When to Use:

  • Security updates
  • Bug fixes
  • Feature requirements
  • Dependency updates

Key Features:

  • Flexible package selection
  • Multiple update strategies
  • Comprehensive testing
  • GitHub workflow integration
  • Rollback procedures

Typical Duration: 10-20 minutes (varies by package)

Risk Level: Medium (depends on package criticality)

5. /deploy-all - Multi-Host Deployment

Purpose: Deploy configuration changes to all hosts

When to Use:

  • After configuration changes
  • Post-testing deployment
  • System-wide updates
  • Coordinated rollouts

Key Features:

  • Pre-deployment validation
  • Multiple deployment strategies
  • Post-deployment verification
  • Monitoring integration
  • Emergency rollback
  • Deployment documentation

Typical Duration: 5-15 minutes (depends on strategy)

Risk Level: Medium-High (affects all hosts)

6. /config-audit - Configuration Quality Audit

Purpose: Audit configuration for anti-patterns and quality

When to Use:

  • Monthly quality checks
  • Before major releases
  • After large changes
  • Code review processes

Key Features:

  • Anti-pattern detection
  • Security audit
  • Module system review
  • Performance analysis
  • Code quality metrics
  • Comprehensive reporting
  • Issue creation for findings

Typical Duration: 15-25 minutes

Risk Level: None (analysis only)

Command Comparison Matrix

Command Frequency Duration Risk Auto-Deploy GitHub Integration
/flake-update Weekly 10-15m Medium Yes Yes
/system-health-check Weekly 5-10m None No Optional
/update-claude-code Monthly 8-12m Low Yes Yes
/update-package As needed 10-20m Medium Yes Yes
/deploy-all As needed 5-15m Med-High Yes Optional
/config-audit Monthly 15-25m None No Yes

Integration with Infrastructure

GitHub Workflow Integration

Commands integrate with the GitHub-based workflow:

# Command creates issue
/update-claude-code
# → Creates GitHub issue with /new_task
# → Creates branch following conventions
# → Creates PR with testing evidence
# → Tracks progress in GitHub

# Command checks status
/system-health-check
# → Reviews open issues with /check_tasks
# → Reports on issue status
# → Recommends prioritization

Just Command Integration

Commands leverage existing just commands:

# Commands use just for:
just check-syntax    # Syntax validation
just validate-quick  # Fast configuration validation
just quick-test      # Parallel host testing
just quick-deploy    # Smart deployment
just test-host HOST  # Individual host testing

Monitoring Integration

Commands interact with monitoring stack:

# Health checks verify:
grafana-status       # Dashboard availability
prometheus-status    # Metrics collection
node-exporter-status # Exporter health

# Commands access monitoring:
curl http://p620:9090/api/v1/targets  # Prometheus
curl http://p620:9093/api/v2/alerts   # Alertmanager

Documentation Integration

Commands reference infrastructure documentation:

  • @docs/PATTERNS.md - NixOS best practices
  • @docs/NIXOS-ANTI-PATTERNS.md - What to avoid
  • @docs/GITHUB-WORKFLOW.md - Development workflow
  • @.agent-os/product/roadmap.md - Project priorities

Command Development Workflow

Creating New Commands

  1. Identify Need
# What repetitive task needs automation?
# What workflow needs documentation?
# What process has high error risk?
  1. Design Command
# Command structure:

- Task overview
- Prerequisites
- Step-by-step workflow
- Success criteria
- Rollback procedures
- Troubleshooting
- Documentation references
  1. Implement Command
# Create command file
vim .claude/commands/new-command.md

# Follow existing patterns
# Include comprehensive error handling
# Add rollback procedures
  1. Test Command
# Test execution
claude
/new-command

# Verify all steps
# Test error conditions
# Confirm rollback works
  1. Document Command
# Add to docs/Nixos/
vim docs/Nixos/New-Command-Guide.md

# Update README
vim docs/Nixos/README.md
  1. Create GitHub Issue
/new_task
# Track command creation
# Document design decisions
# Plan testing

Updating Existing Commands

# 1. Identify improvement
# - Error handling gaps
# - Missing features
# - Outdated procedures

# 2. Edit command
vim .claude/commands/COMMAND.md

# 3. Test changes
claude
/COMMAND

# 4. Update documentation
vim docs/Nixos/COMMAND-Guide.md

# 5. Commit changes
git add .claude/commands/COMMAND.md docs/Nixos/
git commit -m "feat(commands): improve /COMMAND workflow"

Best Practices

Command Usage

Do:

  • Read command documentation first
  • Review command output carefully
  • Follow all validation steps
  • Monitor after deployment
  • Create GitHub issues for tracking
  • Keep commands up-to-date

Don't:

  • Skip prerequisite checks
  • Ignore warning messages
  • Deploy without testing
  • Forget rollback procedures
  • Skip documentation updates

Command Development

Do:

  • Follow existing patterns
  • Include comprehensive error handling
  • Add rollback procedures
  • Document all steps clearly
  • Test thoroughly before use
  • Keep commands maintainable

Don't:

  • Create overly complex commands
  • Skip error handling
  • Forget rollback procedures
  • Write unclear documentation
  • Neglect testing

Safety Features

Built-in Safeguards

  1. Prerequisites Checks
# Commands verify before executing:
- Git status clean
- No critical open issues
- Monitoring operational
- Hosts reachable
  1. Validation Steps
# Commands validate at each stage:
just check-syntax     # Syntax errors
just validate-quick   # Configuration errors
just quick-test       # Build errors
  1. Rollback Procedures
# Every command includes:
- Generation rollback: nixos-rebuild switch --rollback
- Git revert: git revert COMMIT
- Manual recovery: step-by-step instructions
  1. Monitoring Integration
# Commands verify deployment health:
- Service status checks
- Monitoring metrics
- Alert verification
- 24-hour monitoring plan

Risk Mitigation

Risk Mitigation
Syntax errors Pre-deployment syntax validation
Build failures Comprehensive testing before deploy
Service failures Post-deployment verification
Configuration drift Declarative configuration + git
Security issues Secrets management + hardening checks
Performance issues Resource monitoring + alerts
Human error Automated workflows + validation

Troubleshooting

Command Not Executing

# 1. Verify command exists
ls -la .claude/commands/

# 2. Check command file syntax
cat .claude/commands/COMMAND.md

# 3. Restart Claude Code
exit
claude

Command Fails During Execution

# 1. Review error message
# Error messages indicate specific failure

# 2. Check prerequisites
# Ensure all prerequisites met

# 3. Use rollback procedure
# Each command includes rollback steps

# 4. Create issue for investigation
/new_task

Unexpected Results

# 1. Review command output
# Commands provide detailed logging

# 2. Check system state
/system-health-check

# 3. Consult documentation
cat docs/Nixos/COMMAND-Guide.md

# 4. Check recent changes
git log --oneline -10

Performance Optimization

Command Execution Speed

Optimization Impact
Parallel testing 75% faster
Smart deployment Skip unchanged hosts
Binary caching Faster builds
Local testing No network overhead

Workflow Efficiency

# Traditional workflow: 30+ minutes
# 1. Manual research (5-10 min)
# 2. File updates (5 min)
# 3. Hash calculation (2-5 min)
# 4. Testing (5-10 min)
# 5. GitHub workflow (5-10 min)
# 6. Deployment (5 min)

# Command workflow: 5-15 minutes
/update-package
# Everything automated

Result: 10x productivity improvement

Metrics and Monitoring

Command Success Metrics

Track command effectiveness:

# Success rate
echo "Commands run: N"
echo "Successful: N (X%)"
echo "Failed: N (X%)"

# Time savings
echo "Average time without commands: 30 min"
echo "Average time with commands: 10 min"
echo "Time saved: 20 min per operation"

# Quality improvements
echo "Issues detected before deployment: N"
echo "Rollbacks required: N"
echo "Anti-patterns eliminated: N"

Usage Statistics

# Track command usage
/flake-update: Weekly (52/year)
/system-health-check: Weekly (52/year)
/update-package: Monthly (12/year)
/config-audit: Monthly (12/year)

# Total operations: ~150/year
# Time saved: 50+ hours/year

Future Enhancements

Planned Commands

  1. /backup-verify
  2. Verify backup status
  3. Test restore procedures
  4. Generate backup reports

  5. /security-audit

  6. Security-focused audit
  7. Vulnerability scanning
  8. Compliance checking

  9. /performance-analysis

  10. Performance profiling
  11. Bottleneck identification
  12. Optimization recommendations

  13. /weekly-maintenance

  14. Automated weekly tasks
  15. Combines health check + updates
  16. Generates weekly report

  17. /emergency-rollback

  18. Fast rollback all hosts
  19. Emergency recovery
  20. Incident documentation

Enhancement Ideas

  • Command chaining and composition
  • Interactive prompts for complex decisions
  • Progress visualization
  • Notification integration (email, Slack)
  • Historical trend analysis
  • Predictive maintenance

Documentation and Learning

Command Documentation Structure

Each command has:

  1. Command File: .claude/commands/COMMAND.md
  2. Executable workflow
  3. Step-by-step instructions
  4. Code examples

  5. Guide Document: docs/Nixos/COMMAND-Guide.md

  6. Conceptual explanation
  7. Use cases and examples
  8. Best practices
  9. Troubleshooting

  10. Integration References

  11. Links to pattern docs
  12. Links to anti-pattern docs
  13. Links to workflows

Learning Path

  1. Start Here: Read README.md
  2. Understand: Read this overview
  3. Try Simple: Run /system-health-check
  4. Try Complex: Run /flake-update
  5. Master: Create custom commands

Summary

The NixOS custom command system provides:

Automation: Complex workflows become one-line commands Safety: Built-in validation and rollback procedures Documentation: Living documentation of procedures Quality: Automatic adherence to best practices Efficiency: 10x productivity improvement Integration: Seamless GitHub and monitoring integration Learning: Clear workflows for team members

Result: Professional, maintainable, and safe NixOS infrastructure management.


Next Steps:

  1. Review individual command guides
  2. Try commands in your workflow
  3. Customize commands for your needs
  4. Create new commands as needed
  5. Share improvements with the team