Skip to content

Operations Runbook

This runbook provides step-by-step procedures for diagnosing and resolving common operational issues.

Before diving into specific issues, gather basic information:

Terminal window
# Check if mik is running
pgrep -a mik
# Check listening ports
ss -tlnp | grep mik
# View recent logs
journalctl -u mik --since "10 minutes ago" -f
# Check health endpoint
curl -s http://localhost:3000/health | jq
# Check metrics
curl -s http://localhost:3000/metrics | head -50

Symptoms: Slow responses, P99 latency spikes, user complaints.

Diagnosis Steps:

  1. Check circuit breaker state in metrics:

    Terminal window
    curl -s http://localhost:3000/metrics | grep circuit_breaker

    If mik_circuit_breaker_state shows 1, the circuit is open.

  2. Check AOT cache hit ratio:

    Terminal window
    curl -s http://localhost:3000/metrics | grep cache

    Low hit ratio indicates frequent cold starts.

  3. Check for slow handlers:

    Terminal window
    journalctl -u mik | jq 'select(.duration_ms > 1000)' | head -20
  4. Check system resources:

    Terminal window
    top -p $(pgrep mik)
    iostat -x 1 5

Resolution:

CauseSolution
Cold startsIncrease cache_size in mik.toml
Circuit breaker openCheck downstream services, wait for recovery
High CPUReduce max_concurrent_requests
Disk I/OMove AOT cache to faster storage

Symptoms: Process killed, OOM messages in dmesg, Killed in logs.

Diagnosis Steps:

  1. Check memory metrics:

    Terminal window
    curl -s http://localhost:3000/metrics | grep memory
  2. Check system memory:

    Terminal window
    free -h
    dmesg | grep -i "out of memory"
  3. Review module cache settings:

    Terminal window
    grep -E "cache_size|max_cache_mb" /etc/mik/mik.toml

Resolution:

Reduce memory usage in mik.toml:

[server]
max_cache_mb = 128 # Reduce cache memory limit
cache_size = 50 # Fewer cached modules
max_concurrent_requests = 500 # Fewer simultaneous requests

For persistent issues:

  • Add swap space as a safety net
  • Increase container/VM memory
  • Split into multiple smaller instances

Symptoms: 500 errors, “Failed to instantiate” in logs, 404 for module routes.

Diagnosis Steps:

  1. Verify module file exists:

    Terminal window
    ls -la /var/lib/mik/modules/*.wasm
  2. Check file permissions:

    Terminal window
    stat /var/lib/mik/modules/mymodule.wasm
  3. Validate WASM format:

    Terminal window
    file /var/lib/mik/modules/mymodule.wasm
    # Should say: WebAssembly (wasm) binary module
  4. Check for AOT cache corruption:

    Terminal window
    ls -la /var/lib/mik/modules/*.wasm.aot
  5. Review logs for specific errors:

    Terminal window
    journalctl -u mik | grep -i "instantiate\|module\|wasm" | tail -20

Resolution:

CauseSolution
File not foundDeploy missing module
Permission deniedchmod 644 module.wasm
Invalid WASMRebuild with mik build -rc
AOT cache staleDelete .wasm.aot files, restart mik
wasmtime version mismatchRebuild AOT cache with current version

Clear AOT cache:

Terminal window
rm /var/lib/mik/modules/*.wasm.aot
systemctl restart mik

Symptoms: 401/403 responses, “auth_failure” in audit logs.

Diagnosis Steps:

  1. Check if API key is configured:

    Terminal window
    grep MIK_API_KEY /etc/systemd/system/mik.service
  2. Verify request includes correct header:

    Terminal window
    # Test with key
    curl -v -H "X-API-Key: your-key" http://localhost:9919/instances
  3. Check audit logs:

    Terminal window
    journalctl -u mik | jq 'select(.target == "audit")'

Resolution:

CauseSolution
Missing headerAdd X-API-Key header to requests
Wrong keyVerify key matches MIK_API_KEY env var
Key not setSet MIK_API_KEY in systemd service

Symptoms: curl: (7) Failed to connect, service unreachable.

Diagnosis Steps:

  1. Check if process is running:

    Terminal window
    systemctl status mik
    pgrep -a mik
  2. Check listening ports:

    Terminal window
    ss -tlnp | grep -E "3000|9919"
  3. Check bind address:

    Terminal window
    grep -E "HOST|port" /etc/mik/mik.toml
  4. Check firewall:

    Terminal window
    iptables -L -n | grep -E "3000|9919"

Resolution:

CauseSolution
Process not runningsystemctl start mik
Wrong portCheck mik.toml port setting
Bound to 127.0.0.1Set HOST=0.0.0.0 for external access
Firewall blockingOpen port in firewall rules

Symptoms: 503 responses, “circuit-open” errors.

Diagnosis Steps:

  1. Check circuit state:

    Terminal window
    curl -s http://localhost:3000/metrics | grep circuit
  2. Identify failing module:

    Terminal window
    journalctl -u mik | grep -i "circuit\|failure" | tail -20
  3. Check downstream services:

    Terminal window
    curl -s http://localhost:9919/health

Resolution:

The circuit breaker opens after 5 consecutive failures and recovers after 30 seconds.

  1. Wait for automatic recovery (30s)
  2. Fix the root cause (downstream service, bad input, etc.)
  3. Monitor for recurring issues

For testing, manually trigger a request to attempt recovery:

Terminal window
curl http://localhost:3000/run/failing-module/health

Symptoms: Write failures, “No space left on device” errors.

Diagnosis Steps:

  1. Check disk usage:

    Terminal window
    df -h /var/lib/mik
  2. Find large files:

    Terminal window
    du -sh /var/lib/mik/*
    du -sh /var/lib/mik/modules/*.aot
  3. Check log size:

    Terminal window
    du -sh /var/log/mik/

Resolution:

  1. Clean AOT cache:

    Terminal window
    mik cache clean # Remove stale entries
    # or
    mik cache clear # Clear all
  2. Rotate logs:

    Terminal window
    journalctl --vacuum-size=100M
  3. Remove old logs:

    Terminal window
    find /var/log/mik -name "*.log.*" -mtime +7 -delete

EndpointPurposeExpected Response
/healthBasic liveness{"status": "ready"}
/metricsPrometheus metricsPrometheus text format
#!/bin/bash
set -e
# Check HTTP health
response=$(curl -sf http://localhost:3000/health)
status=$(echo "$response" | jq -r '.status')
if [ "$status" != "ready" ]; then
echo "Health check failed: $response"
exit 1
fi
echo "Health check passed"
exit 0
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 5

Terminal window
# Recent errors
journalctl -u mik --since "1 hour ago" -p err
# JSON structured search
journalctl -u mik | jq 'select(.level == "ERROR")'
# Specific module errors
journalctl -u mik | jq 'select(.module == "auth" and .level == "ERROR")'
Terminal window
# Slow requests (>1s)
journalctl -u mik | jq 'select(.duration_ms > 1000)'
# Request distribution by module
journalctl -u mik | jq 'select(.message == "request_completed") | .module' | sort | uniq -c
# Error rate by endpoint
journalctl -u mik | jq 'select(.status >= 500) | .path' | sort | uniq -c | sort -rn
Terminal window
# Find all logs for a specific trace
journalctl -u mik | jq 'select(.span.trace_id == "abc123")'
# Find requests taking longer than expected
journalctl -u mik | jq 'select(.duration_ms > 500 and .message == "request_completed")'

Terminal window
# Graceful restart
systemctl restart mik
# Force kill if unresponsive
systemctl kill -s SIGKILL mik
systemctl start mik
Terminal window
# Stop service
systemctl stop mik
# Clear AOT cache
rm /var/lib/mik/modules/*.wasm.aot
# Clear module cache
mik cache clear
# Restart
systemctl start mik
Terminal window
# Stop all instances
systemctl stop mik
# Reset daemon state
rm ~/.mik/state.redb
# Restart
systemctl start mik

Before escalating, ensure you have:

  • Service status (systemctl status mik)
  • Recent logs (last 100 lines)
  • Health endpoint response
  • Metrics snapshot
  • System resources (CPU, memory, disk)
  • Configuration file (mik.toml)
  • Timestamp when issue started
  • Any recent changes (deploys, config changes)