Operator Runbook
This is a minimal operational checklist for keeping Blueprint services healthy and safe in production.
Daily Checks
- Verify the Blueprint Manager process is running and connected to RPC + WS endpoints.
- Confirm service heartbeats are progressing (no sustained gaps).
- Review job error rates and retry spikes.
- Check disk usage for cache + data directories.
Key Signals to Watch
- Heartbeat drift: late or missing heartbeats can trigger QoS degradation.
- Job queue backlog: growing queues indicate capacity pressure.
- RPC latency: slow RPCs lead to missed service events.
- Crash loops: repeated restarts usually imply config or artifact issues.
Incident Response
- Pause new work by stopping the manager.
- Capture logs + recent job failures for root cause.
- Restore service with a known-good config and pinned artifact versions.
- Run a small validation job before resuming full traffic.
Capacity Planning
- Reserve headroom for spikes in service requests and simulations.
- Size storage for artifacts + per-service data.
- Isolate noisy workloads into separate hosts when possible.
Security Hygiene
- Keep keystores isolated and use least-privilege access.
- Rotate operator keys on schedule.
- Use separate RPC credentials per environment.