NixOS Custom Command System - Complete Overview¶
Introduction¶
The NixOS custom command system provides specialized, automated workflows for managing complex NixOS infrastructure. Each command embeds best practices, safety checks, GitHub integration, and comprehensive documentation into repeatable procedures.
Architecture¶
Command Structure¶
.claude/commands/
├── flake-update.md # Flake input management
├── update-claude-code.md # Claude Code package updates
├── system-health-check.md # Infrastructure monitoring
├── update-package.md # Generic package updates
├── deploy-all.md # Multi-host deployment
└── config-audit.md # Configuration quality audit
Command Format¶
Each command is a Markdown file containing:
- Task Overview: High-level description
- Prerequisites: Required checks before execution
- Step-by-Step Procedures: Detailed workflow
- Success Criteria: Validation checklist
- Rollback Procedures: Emergency recovery steps
- Troubleshooting: Common issues and solutions
- Documentation References: Links to relevant docs
Execution Flow¶
graph TD
A[User invokes /command] --> B[Claude Code loads command]
B --> C[Execute prerequisites check]
C --> D{Prerequisites pass?}
D -->|No| E[Report issues, halt]
D -->|Yes| F[Execute main workflow]
F --> G[Run validation checks]
G --> H{Validation pass?}
H -->|No| I[Report errors, rollback]
H -->|Yes| J[Complete successfully]
J --> K[Generate documentation]
K --> L[Update GitHub if needed]
Command Catalog¶
1. /flake-update - Flake Input Management¶
Purpose: Safely update NixOS flake inputs with testing and deployment
When to Use:
- Weekly maintenance schedule
- After upstream security updates
- When new features needed from nixpkgs
- Before major infrastructure changes
Key Features:
- Pre-update system checks
- Parallel host testing
- Smart deployment (only changed hosts)
- Post-deployment verification
- Automatic rollback on failure
- Detailed commit documentation
Typical Duration: 10-15 minutes
Risk Level: Medium (automated testing reduces risk)
2. /system-health-check - Infrastructure Monitoring¶
Purpose: Comprehensive health assessment across all infrastructure
When to Use:
- Weekly scheduled checks
- After major deployments
- When investigating issues
- Before starting new projects
Key Features:
- Multi-host connectivity testing
- Service status verification
- Resource utilization analysis
- Monitoring stack validation
- Media server health (Plex/NZBGet)
- Security audit
- Comprehensive report generation
Typical Duration: 5-10 minutes
Risk Level: None (read-only operations)
3. /update-claude-code - Claude Code Updates¶
Purpose: Update Claude Code package with proper testing
When to Use:
- New release available
- Bug fixes needed
- Feature updates desired
- Monthly maintenance
Key Features:
- Automatic version research
- Hash calculation
- Multi-host testing
- GitHub issue/PR creation
- Deployment verification
Typical Duration: 8-12 minutes
Risk Level: Low (single package update)
4. /update-package - Generic Package Updates¶
Purpose: Update any package in the configuration
When to Use:
- Security updates
- Bug fixes
- Feature requirements
- Dependency updates
Key Features:
- Flexible package selection
- Multiple update strategies
- Comprehensive testing
- GitHub workflow integration
- Rollback procedures
Typical Duration: 10-20 minutes (varies by package)
Risk Level: Medium (depends on package criticality)
5. /deploy-all - Multi-Host Deployment¶
Purpose: Deploy configuration changes to all hosts
When to Use:
- After configuration changes
- Post-testing deployment
- System-wide updates
- Coordinated rollouts
Key Features:
- Pre-deployment validation
- Multiple deployment strategies
- Post-deployment verification
- Monitoring integration
- Emergency rollback
- Deployment documentation
Typical Duration: 5-15 minutes (depends on strategy)
Risk Level: Medium-High (affects all hosts)
6. /config-audit - Configuration Quality Audit¶
Purpose: Audit configuration for anti-patterns and quality
When to Use:
- Monthly quality checks
- Before major releases
- After large changes
- Code review processes
Key Features:
- Anti-pattern detection
- Security audit
- Module system review
- Performance analysis
- Code quality metrics
- Comprehensive reporting
- Issue creation for findings
Typical Duration: 15-25 minutes
Risk Level: None (analysis only)
Command Comparison Matrix¶
| Command | Frequency | Duration | Risk | Auto-Deploy | GitHub Integration |
|---|---|---|---|---|---|
/flake-update |
Weekly | 10-15m | Medium | Yes | Yes |
/system-health-check |
Weekly | 5-10m | None | No | Optional |
/update-claude-code |
Monthly | 8-12m | Low | Yes | Yes |
/update-package |
As needed | 10-20m | Medium | Yes | Yes |
/deploy-all |
As needed | 5-15m | Med-High | Yes | Optional |
/config-audit |
Monthly | 15-25m | None | No | Yes |
Integration with Infrastructure¶
GitHub Workflow Integration¶
Commands integrate with the GitHub-based workflow:
# Command creates issue
/update-claude-code
# → Creates GitHub issue with /new_task
# → Creates branch following conventions
# → Creates PR with testing evidence
# → Tracks progress in GitHub
# Command checks status
/system-health-check
# → Reviews open issues with /check_tasks
# → Reports on issue status
# → Recommends prioritization
Just Command Integration¶
Commands leverage existing just commands:
# Commands use just for:
just check-syntax # Syntax validation
just validate-quick # Fast configuration validation
just quick-test # Parallel host testing
just quick-deploy # Smart deployment
just test-host HOST # Individual host testing
Monitoring Integration¶
Commands interact with monitoring stack:
# Health checks verify:
grafana-status # Dashboard availability
prometheus-status # Metrics collection
node-exporter-status # Exporter health
# Commands access monitoring:
curl http://p620:9090/api/v1/targets # Prometheus
curl http://p620:9093/api/v2/alerts # Alertmanager
Documentation Integration¶
Commands reference infrastructure documentation:
@docs/PATTERNS.md- NixOS best practices@docs/NIXOS-ANTI-PATTERNS.md- What to avoid@docs/GITHUB-WORKFLOW.md- Development workflow@.agent-os/product/roadmap.md- Project priorities
Command Development Workflow¶
Creating New Commands¶
- Identify Need
# What repetitive task needs automation?
# What workflow needs documentation?
# What process has high error risk?
- Design Command
# Command structure:
- Task overview
- Prerequisites
- Step-by-step workflow
- Success criteria
- Rollback procedures
- Troubleshooting
- Documentation references
- Implement Command
# Create command file
vim .claude/commands/new-command.md
# Follow existing patterns
# Include comprehensive error handling
# Add rollback procedures
- Test Command
# Test execution
claude
/new-command
# Verify all steps
# Test error conditions
# Confirm rollback works
- Document Command
- Create GitHub Issue
Updating Existing Commands¶
# 1. Identify improvement
# - Error handling gaps
# - Missing features
# - Outdated procedures
# 2. Edit command
vim .claude/commands/COMMAND.md
# 3. Test changes
claude
/COMMAND
# 4. Update documentation
vim docs/Nixos/COMMAND-Guide.md
# 5. Commit changes
git add .claude/commands/COMMAND.md docs/Nixos/
git commit -m "feat(commands): improve /COMMAND workflow"
Best Practices¶
Command Usage¶
Do:
- Read command documentation first
- Review command output carefully
- Follow all validation steps
- Monitor after deployment
- Create GitHub issues for tracking
- Keep commands up-to-date
Don't:
- Skip prerequisite checks
- Ignore warning messages
- Deploy without testing
- Forget rollback procedures
- Skip documentation updates
Command Development¶
Do:
- Follow existing patterns
- Include comprehensive error handling
- Add rollback procedures
- Document all steps clearly
- Test thoroughly before use
- Keep commands maintainable
Don't:
- Create overly complex commands
- Skip error handling
- Forget rollback procedures
- Write unclear documentation
- Neglect testing
Safety Features¶
Built-in Safeguards¶
- Prerequisites Checks
# Commands verify before executing:
- Git status clean
- No critical open issues
- Monitoring operational
- Hosts reachable
- Validation Steps
# Commands validate at each stage:
just check-syntax # Syntax errors
just validate-quick # Configuration errors
just quick-test # Build errors
- Rollback Procedures
# Every command includes:
- Generation rollback: nixos-rebuild switch --rollback
- Git revert: git revert COMMIT
- Manual recovery: step-by-step instructions
- Monitoring Integration
# Commands verify deployment health:
- Service status checks
- Monitoring metrics
- Alert verification
- 24-hour monitoring plan
Risk Mitigation¶
| Risk | Mitigation |
|---|---|
| Syntax errors | Pre-deployment syntax validation |
| Build failures | Comprehensive testing before deploy |
| Service failures | Post-deployment verification |
| Configuration drift | Declarative configuration + git |
| Security issues | Secrets management + hardening checks |
| Performance issues | Resource monitoring + alerts |
| Human error | Automated workflows + validation |
Troubleshooting¶
Command Not Executing¶
# 1. Verify command exists
ls -la .claude/commands/
# 2. Check command file syntax
cat .claude/commands/COMMAND.md
# 3. Restart Claude Code
exit
claude
Command Fails During Execution¶
# 1. Review error message
# Error messages indicate specific failure
# 2. Check prerequisites
# Ensure all prerequisites met
# 3. Use rollback procedure
# Each command includes rollback steps
# 4. Create issue for investigation
/new_task
Unexpected Results¶
# 1. Review command output
# Commands provide detailed logging
# 2. Check system state
/system-health-check
# 3. Consult documentation
cat docs/Nixos/COMMAND-Guide.md
# 4. Check recent changes
git log --oneline -10
Performance Optimization¶
Command Execution Speed¶
| Optimization | Impact |
|---|---|
| Parallel testing | 75% faster |
| Smart deployment | Skip unchanged hosts |
| Binary caching | Faster builds |
| Local testing | No network overhead |
Workflow Efficiency¶
# Traditional workflow: 30+ minutes
# 1. Manual research (5-10 min)
# 2. File updates (5 min)
# 3. Hash calculation (2-5 min)
# 4. Testing (5-10 min)
# 5. GitHub workflow (5-10 min)
# 6. Deployment (5 min)
# Command workflow: 5-15 minutes
/update-package
# Everything automated
Result: 10x productivity improvement
Metrics and Monitoring¶
Command Success Metrics¶
Track command effectiveness:
# Success rate
echo "Commands run: N"
echo "Successful: N (X%)"
echo "Failed: N (X%)"
# Time savings
echo "Average time without commands: 30 min"
echo "Average time with commands: 10 min"
echo "Time saved: 20 min per operation"
# Quality improvements
echo "Issues detected before deployment: N"
echo "Rollbacks required: N"
echo "Anti-patterns eliminated: N"
Usage Statistics¶
# Track command usage
/flake-update: Weekly (52/year)
/system-health-check: Weekly (52/year)
/update-package: Monthly (12/year)
/config-audit: Monthly (12/year)
# Total operations: ~150/year
# Time saved: 50+ hours/year
Future Enhancements¶
Planned Commands¶
/backup-verify- Verify backup status
- Test restore procedures
-
Generate backup reports
-
/security-audit - Security-focused audit
- Vulnerability scanning
-
Compliance checking
-
/performance-analysis - Performance profiling
- Bottleneck identification
-
Optimization recommendations
-
/weekly-maintenance - Automated weekly tasks
- Combines health check + updates
-
Generates weekly report
-
/emergency-rollback - Fast rollback all hosts
- Emergency recovery
- Incident documentation
Enhancement Ideas¶
- Command chaining and composition
- Interactive prompts for complex decisions
- Progress visualization
- Notification integration (email, Slack)
- Historical trend analysis
- Predictive maintenance
Documentation and Learning¶
Command Documentation Structure¶
Each command has:
- Command File:
.claude/commands/COMMAND.md - Executable workflow
- Step-by-step instructions
-
Code examples
-
Guide Document:
docs/Nixos/COMMAND-Guide.md - Conceptual explanation
- Use cases and examples
- Best practices
-
Troubleshooting
-
Integration References
- Links to pattern docs
- Links to anti-pattern docs
- Links to workflows
Learning Path¶
- Start Here: Read README.md
- Understand: Read this overview
- Try Simple: Run
/system-health-check - Try Complex: Run
/flake-update - Master: Create custom commands
Summary¶
The NixOS custom command system provides:
Automation: Complex workflows become one-line commands Safety: Built-in validation and rollback procedures Documentation: Living documentation of procedures Quality: Automatic adherence to best practices Efficiency: 10x productivity improvement Integration: Seamless GitHub and monitoring integration Learning: Clear workflows for team members
Result: Professional, maintainable, and safe NixOS infrastructure management.
Next Steps:
- Review individual command guides
- Try commands in your workflow
- Customize commands for your needs
- Create new commands as needed
- Share improvements with the team