Operations Runbook
This runbook provides step-by-step procedures for diagnosing and resolving common operational issues.
Quick Diagnostics
Section titled “Quick Diagnostics”Before diving into specific issues, gather basic information:
# Check if mik is runningpgrep -a mik
# Check listening portsss -tlnp | grep mik
# View recent logsjournalctl -u mik --since "10 minutes ago" -f
# Check health endpointcurl -s http://localhost:3000/health | jq
# Check metricscurl -s http://localhost:3000/metrics | head -50Common Issues
Section titled “Common Issues”High Latency
Section titled “High Latency”Symptoms: Slow responses, P99 latency spikes, user complaints.
Diagnosis Steps:
-
Check circuit breaker state in metrics:
Terminal window curl -s http://localhost:3000/metrics | grep circuit_breakerIf
mik_circuit_breaker_stateshows1, the circuit is open. -
Check AOT cache hit ratio:
Terminal window curl -s http://localhost:3000/metrics | grep cacheLow hit ratio indicates frequent cold starts.
-
Check for slow handlers:
Terminal window journalctl -u mik | jq 'select(.duration_ms > 1000)' | head -20 -
Check system resources:
Terminal window top -p $(pgrep mik)iostat -x 1 5
Resolution:
| Cause | Solution |
|---|---|
| Cold starts | Increase cache_size in mik.toml |
| Circuit breaker open | Check downstream services, wait for recovery |
| High CPU | Reduce max_concurrent_requests |
| Disk I/O | Move AOT cache to faster storage |
Out of Memory
Section titled “Out of Memory”Symptoms: Process killed, OOM messages in dmesg, Killed in logs.
Diagnosis Steps:
-
Check memory metrics:
Terminal window curl -s http://localhost:3000/metrics | grep memory -
Check system memory:
Terminal window free -hdmesg | grep -i "out of memory" -
Review module cache settings:
Terminal window grep -E "cache_size|max_cache_mb" /etc/mik/mik.toml
Resolution:
Reduce memory usage in mik.toml:
[server]max_cache_mb = 128 # Reduce cache memory limitcache_size = 50 # Fewer cached modulesmax_concurrent_requests = 500 # Fewer simultaneous requestsFor persistent issues:
- Add swap space as a safety net
- Increase container/VM memory
- Split into multiple smaller instances
Module Load Failures
Section titled “Module Load Failures”Symptoms: 500 errors, “Failed to instantiate” in logs, 404 for module routes.
Diagnosis Steps:
-
Verify module file exists:
Terminal window ls -la /var/lib/mik/modules/*.wasm -
Check file permissions:
Terminal window stat /var/lib/mik/modules/mymodule.wasm -
Validate WASM format:
Terminal window file /var/lib/mik/modules/mymodule.wasm# Should say: WebAssembly (wasm) binary module -
Check for AOT cache corruption:
Terminal window ls -la /var/lib/mik/modules/*.wasm.aot -
Review logs for specific errors:
Terminal window journalctl -u mik | grep -i "instantiate\|module\|wasm" | tail -20
Resolution:
| Cause | Solution |
|---|---|
| File not found | Deploy missing module |
| Permission denied | chmod 644 module.wasm |
| Invalid WASM | Rebuild with mik build -rc |
| AOT cache stale | Delete .wasm.aot files, restart mik |
| wasmtime version mismatch | Rebuild AOT cache with current version |
Clear AOT cache:
rm /var/lib/mik/modules/*.wasm.aotsystemctl restart mikAuthentication Failures
Section titled “Authentication Failures”Symptoms: 401/403 responses, “auth_failure” in audit logs.
Diagnosis Steps:
-
Check if API key is configured:
Terminal window grep MIK_API_KEY /etc/systemd/system/mik.service -
Verify request includes correct header:
Terminal window # Test with keycurl -v -H "X-API-Key: your-key" http://localhost:9919/instances -
Check audit logs:
Terminal window journalctl -u mik | jq 'select(.target == "audit")'
Resolution:
| Cause | Solution |
|---|---|
| Missing header | Add X-API-Key header to requests |
| Wrong key | Verify key matches MIK_API_KEY env var |
| Key not set | Set MIK_API_KEY in systemd service |
Connection Refused
Section titled “Connection Refused”Symptoms: curl: (7) Failed to connect, service unreachable.
Diagnosis Steps:
-
Check if process is running:
Terminal window systemctl status mikpgrep -a mik -
Check listening ports:
Terminal window ss -tlnp | grep -E "3000|9919" -
Check bind address:
Terminal window grep -E "HOST|port" /etc/mik/mik.toml -
Check firewall:
Terminal window iptables -L -n | grep -E "3000|9919"
Resolution:
| Cause | Solution |
|---|---|
| Process not running | systemctl start mik |
| Wrong port | Check mik.toml port setting |
| Bound to 127.0.0.1 | Set HOST=0.0.0.0 for external access |
| Firewall blocking | Open port in firewall rules |
Circuit Breaker Tripped
Section titled “Circuit Breaker Tripped”Symptoms: 503 responses, “circuit-open” errors.
Diagnosis Steps:
-
Check circuit state:
Terminal window curl -s http://localhost:3000/metrics | grep circuit -
Identify failing module:
Terminal window journalctl -u mik | grep -i "circuit\|failure" | tail -20 -
Check downstream services:
Terminal window curl -s http://localhost:9919/health
Resolution:
The circuit breaker opens after 5 consecutive failures and recovers after 30 seconds.
- Wait for automatic recovery (30s)
- Fix the root cause (downstream service, bad input, etc.)
- Monitor for recurring issues
For testing, manually trigger a request to attempt recovery:
curl http://localhost:3000/run/failing-module/healthDisk Space Issues
Section titled “Disk Space Issues”Symptoms: Write failures, “No space left on device” errors.
Diagnosis Steps:
-
Check disk usage:
Terminal window df -h /var/lib/mik -
Find large files:
Terminal window du -sh /var/lib/mik/*du -sh /var/lib/mik/modules/*.aot -
Check log size:
Terminal window du -sh /var/log/mik/
Resolution:
-
Clean AOT cache:
Terminal window mik cache clean # Remove stale entries# ormik cache clear # Clear all -
Rotate logs:
Terminal window journalctl --vacuum-size=100M -
Remove old logs:
Terminal window find /var/log/mik -name "*.log.*" -mtime +7 -delete
Health Checks
Section titled “Health Checks”Endpoints
Section titled “Endpoints”| Endpoint | Purpose | Expected Response |
|---|---|---|
/health | Basic liveness | {"status": "ready"} |
/metrics | Prometheus metrics | Prometheus text format |
Health Check Script
Section titled “Health Check Script”#!/bin/bashset -e
# Check HTTP healthresponse=$(curl -sf http://localhost:3000/health)status=$(echo "$response" | jq -r '.status')
if [ "$status" != "ready" ]; then echo "Health check failed: $response" exit 1fi
echo "Health check passed"exit 0Kubernetes Probes
Section titled “Kubernetes Probes”livenessProbe: httpGet: path: /health port: 3000 initialDelaySeconds: 5 periodSeconds: 10
readinessProbe: httpGet: path: /health port: 3000 initialDelaySeconds: 5 periodSeconds: 5Log Analysis
Section titled “Log Analysis”Finding Errors
Section titled “Finding Errors”# Recent errorsjournalctl -u mik --since "1 hour ago" -p err
# JSON structured searchjournalctl -u mik | jq 'select(.level == "ERROR")'
# Specific module errorsjournalctl -u mik | jq 'select(.module == "auth" and .level == "ERROR")'Performance Analysis
Section titled “Performance Analysis”# Slow requests (>1s)journalctl -u mik | jq 'select(.duration_ms > 1000)'
# Request distribution by modulejournalctl -u mik | jq 'select(.message == "request_completed") | .module' | sort | uniq -c
# Error rate by endpointjournalctl -u mik | jq 'select(.status >= 500) | .path' | sort | uniq -c | sort -rnTrace Correlation
Section titled “Trace Correlation”# Find all logs for a specific tracejournalctl -u mik | jq 'select(.span.trace_id == "abc123")'
# Find requests taking longer than expectedjournalctl -u mik | jq 'select(.duration_ms > 500 and .message == "request_completed")'Recovery Procedures
Section titled “Recovery Procedures”Full Service Restart
Section titled “Full Service Restart”# Graceful restartsystemctl restart mik
# Force kill if unresponsivesystemctl kill -s SIGKILL miksystemctl start mikCache Reset
Section titled “Cache Reset”# Stop servicesystemctl stop mik
# Clear AOT cacherm /var/lib/mik/modules/*.wasm.aot
# Clear module cachemik cache clear
# Restartsystemctl start mikDaemon State Reset
Section titled “Daemon State Reset”# Stop all instancessystemctl stop mik
# Reset daemon staterm ~/.mik/state.redb
# Restartsystemctl start mikEscalation Checklist
Section titled “Escalation Checklist”Before escalating, ensure you have:
- Service status (
systemctl status mik) - Recent logs (last 100 lines)
- Health endpoint response
- Metrics snapshot
- System resources (CPU, memory, disk)
- Configuration file (
mik.toml) - Timestamp when issue started
- Any recent changes (deploys, config changes)
Next Steps
Section titled “Next Steps”- Monitoring & Observability - Set up metrics and alerts
- Production Deployment - Deployment best practices
- systemd Service Setup - Running as a system service