Operations Runbook

Production operations procedures for Magic Runtime. Go-live, on-call, rollback, monitoring, and disaster recovery.

Go-Live Checklist

Pre-deployment

Create production .env from .env.example
Generate strong secrets (32+ chars) for JWT_SECRET and ADMIN_API_KEY
Set exact ALLOWED_ORIGINS (no wildcards)
Configure tight CONTROLLER_ALLOWLIST
Set ENVIRONMENT=prod
Configure TLS termination (load balancer, Cloudflare, or Nginx+certbot)

Generate secrets

openssl rand -hex 32  # For JWT_SECRET
openssl rand -hex 32  # For ADMIN_API_KEY
            

Deployment Steps

1. Download and verify

wget https://magic.threadsync.io/deploy/magic-runtime-2.0.0-production.tar.gz
wget https://magic.threadsync.io/deploy/SHA256SUMS
wget https://magic.threadsync.io/deploy/SHA256SUMS.sig
sha256sum -c SHA256SUMS --ignore-missing
gpg --verify SHA256SUMS.sig SHA256SUMS
            

2. Extract and configure

tar -xzf magic-runtime-2.0.0-production.tar.gz
cd magic-runtime
cp .env.example .env
chmod 600 .env
# Edit .env with production values
            

3. Deploy stack

docker compose up -d --build
./scripts/init_db.sh
            

4. Verify deployment

curl -s http://localhost/api/v1/health/live | jq .
curl -s http://localhost/api/v1/health/ready | jq .

# Test controller execution
curl -s -X POST http://localhost/api/v1/controllers/execute \
  -H 'Content-Type: application/json' \
  -H "X-API-Key: $ADMIN_API_KEY" \
  -d '{"controller": "DemoController", "message": "Hello, Magic!"}' | jq .
            

On-Call Checklist

Triage (first 2 minutes)

Check site: curl -sI https://your-domain.com/api/v1/health/live
Check containers: docker ps -a --format "table {{.Names}}\t{{.Status}}"
Check recent logs: docker logs magic-api --since 5m --tail 50
Classify severity (see table below)

Severity	Criteria	Response	Escalate After
P1	Site down, all requests failing	Immediate	15 min
P2	Error rate >5% or latency >5s	15 min	1 hour
P3	Resource threshold breached	Next business day	-

All-in-one status check

echo "=== Containers ===" && docker ps --format "{{.Names}}: {{.Status}}" && \
echo "=== API Health ===" && curl -s http://localhost/api/v1/health/live | jq . && \
echo "=== Readiness ===" && curl -s http://localhost/api/v1/health/ready | jq . && \
echo "=== Disk ===" && df -h / && \
echo "=== Memory ===" && free -h && \
echo "=== CPU ===" && uptime
            

Rollback Procedure

Quick rollback (< 2 minutes)

Application rollback

cd /opt/magic-runtime

# Stop current containers
docker compose down

# Restore previous docker-compose.yml (if modified)
cp docker-compose.yml.bak docker-compose.yml

# Start previous version
docker compose up -d

# Verify
curl -s http://localhost/api/v1/health/live | jq .
            

Controller rollback

Roll back a specific controller

# Via Magic CLI
docker compose exec runtime magic rollback <ControllerName> --to <previous-version>

# Verify the rollback
docker compose exec runtime magic status <ControllerName>
            

Full rollback (with database)

Database rollback is destructive

Only use full rollback if the deployment included database migrations that caused issues. This restores from backup and may lose data written after the backup.

Full rollback with database restore

# Stop services
docker compose down

# Restore database from backup
./scripts/restore.sh backups/magic_backup_YYYYMMDD_HHMMSS.sql

# Restore previous application version
cd /opt && tar -xzf magic-runtime-<previous-version>.tar.gz
cd magic-runtime

# Start previous version
docker compose up -d

# Verify
curl -s http://localhost/api/v1/health/live | jq .
curl -s http://localhost/api/v1/health/ready | jq .
            

Rollback checklist

Identify what changed (controller? config? migration?)
Choose rollback scope (controller-only, app-only, or full+DB)
Execute rollback steps above
Verify health + readiness endpoints
Verify existing controllers still execute
Notify team with incident summary
Create post-incident ticket

Monitoring & Alerts

Key Metrics

Metric	Warning	Critical
API Down	-	2 min
Error Rate	> 5%	> 10%
Response Time (p95)	> 2s	> 5s
CPU Usage	> 70%	> 90%
Memory Usage	> 70%	> 90%
Disk Usage	> 70%	> 85%

Incident Response

API Down

Check container status: docker ps -a
Check logs: docker logs magic-api --tail 100
Check database: docker exec magic-api python -c "from app.db.session import engine; engine.connect()"
Restart: docker compose down && docker compose up -d

High Error Rate

Check recent errors: docker logs magic-api --since 10m | grep ERROR
Identify controller: docker logs magic-api --since 10m | grep '"level":"error"' | jq '.controller'
Disable if needed: Remove from CONTROLLER_ALLOWLIST, then docker compose restart

Security Procedures

Audit Log Queries

Security audit commands

# Check capability violations
docker logs magic-api | grep '"event":"capability_denied"'

# Check auth failures
docker logs magic-api | grep '"event":"auth_failed"'

# Check admin actions
docker logs magic-api | grep '"event":"admin_action"'
            

Rotate Secrets

Generate new secrets: openssl rand -hex 32
Update .env
Restart: docker compose restart
Update any external API key references

Disaster Recovery

Target	Value
RTO (Recovery Time)	2 hours
RPO (Recovery Point)	24 hours (daily backups)

Restore to New Server

Provision new server (4GB+ RAM, Docker installed)
Copy backup + application tarball
Extract, configure .env, deploy
Restore database: ./scripts/restore.sh backups/<file>.sql
Update DNS / load balancer target
Verify all endpoints

Capacity Planning

Metric	Single Server Limit
Concurrent requests	~100-200
Controllers/sec	~20-50 (process sandbox)
Database connections	100 (default pool)

Scaling Options

Vertical: Upgrade server (4GB → 8GB → 16GB)
Horizontal: Add runtime replicas behind load balancer (runtime is stateless)
Database: Read replicas, connection pooling (PgBouncer)
Caching: Redis for controller result caching