No. The agent has context of generic alerts in application, infrastructure & containers that engineering teams get. Runbooks help in situations where you already have an opinion on how to debug an issue.
Writing runbooks is more about giving your own context/nuance to the agent. There are no hard and fast rules on how to write them, but here’s what we recommend:
Keep the instructions similar to how you’d give them to a junior engineer.
Add links to dashboards, log queries or any other query that removes the guess work of schema or context.
With regards to access, if you have a specific way of access a certain file/info, giving that to the agent might not be very relevant. Agent’s accesses are defined based on setup and it will likely not be able to manually request access to a user on the go.
Runbooks can be mapped to an alert as a configuration so that the agent can leverage it while investigating the alert better. Attaching the alert to a runbook is a one-time activity and future alerts of the same pattern.
Alternately, runbooks can be attached to the chat on the go just like you attach files to Cursor for better context.
Runbooks created on the platform will be stored on DrDroid. Alternately, you can connect DrDroid with BitBucket, Github or Confluence. In this case, your Wiki will be the source of truth and will be synced on the platform on a recurring basis.
Explore these practical examples to get started with creating your own runbooks. Each example includes a description, use case, and the complete runbook definition.
While giving this runbook, I will mention the name of service either in an alert or in the prompt. To check the health of the given service, these are the 2 things that you should do:1. Check the HTTP Status by hitting the url. https://service_name.exampleorg.dev/health . Keep a timeout of 5s.2. Check response times of the service in this Grafana dashboard. grafana.exampleorg.com/d/uuid_dashboard_uid/ -- if you see an anomaly, check other data in the dashboard.3. Check logs in loki with the query `error service:service_name`Escalate to service owner if you find either of them are off.
When prompted with an alert or request for database maintenance on a given DB, follow these steps:1. **Check Active Connections** Connect to the Postgres instance and run: `sql SELECT count(*) as active_connections FROM pg_stat_activity WHERE datname = '${DB_NAME}'; ` - Replace `${DB_NAME}` with the actual database name. - If connections > `${MAX_CONNECTIONS}` (default: 100), investigate further.2. **Run VACUUM ANALYZE** Execute the following command: `sql VACUUM ANALYZE; ` - Reclaims storage and updates stats for the query planner. - Set a timeout of **1 hour**.3. **Run REINDEX** Execute: `sql REINDEX DATABASE ${DB_NAME}; ` - This improves index scan performance. - Timeout: **2 hours**.4. **Escalation** - If any operation fails or takes unusually long, escalate to the DB administrator.
To deploy a new image version for a service, follow the steps below:1. **Apply Deployment Manifest** Replace the `${IMAGE_TAG}` and `${NAMESPACE}` in the following: apiVersion: apps/v1 kind: Deployment metadata: name: myapp namespace: ${NAMESPACE} spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myregistry.com/myapp:${IMAGE_TAG} ports: - containerPort: 8080 - Run: kubectl apply -f deployment.yaml2. **Verify Rollout Status** Check whether the new pods are running successfully: kubectl rollout status deployment/myapp -n ${NAMESPACE} - Timeout: **5 minutes** - If rollout is stuck or fails, describe the pods and fetch logs: kubectl describe pod <pod-name> kubectl logs <pod-name>3. **Escalation** - Escalate to DevOps or the owning team if rollout verification fails.
If CPU usage for the `api` service exceeds **90% for more than 5 minutes**, follow these steps:1. **Identify Top Processes** SSH into `${AFFECTED_HOST}` and run: ps -eo pid,ppid,cmd,%cpu --sort=-%cpu | head -n 10 - Note which processes are consuming the most CPU.2. **Capture Java Thread Dump (if applicable)** If you find a Java process in the top list: jstack -l ${JAVA_PID} > /tmp/thread-dump-$(date +%s).txt - Replace `${JAVA_PID}` with the actual PID of the Java process.3. **Scale the API Deployment (if needed)** If `api` is the main contributor to CPU: kubectl scale deployment api --replicas=$((CURRENT_REPLICAS * 2)) -n production - Replace `CURRENT_REPLICAS` with the current replica count.4. **Alert Message** Prepare an alert message with top processes included: High CPU usage detected. Investigation required. Top processes: <paste output from step 1>5. **Escalation** - If CPU usage continues after scaling or thread dump indicates deeper issues, escalate to infra or backend team.