Runbooks are prompts written in natural language that can be used to assist an agent while it’s trying to investigate alerts.

Do I need runbooks for every alert?

No. The agent has context of generic alerts in application, infrastructure & containers that engineering teams get. Runbooks help in situations where you already have an opinion on how to debug an issue.

How to write runbooks?

Writing runbooks is more about giving your own context/nuance to the agent. There are no hard and fast rules on how to write them, but here’s what we recommend:
  • Keep the instructions similar to how you’d give them to a junior engineer.
  • Add links to dashboards, log queries or any other query that removes the guess work of schema or context.
  • With regards to access, if you have a specific way of access a certain file/info, giving that to the agent might not be very relevant. Agent’s accesses are defined based on setup and it will likely not be able to manually request access to a user on the go.

How can the runbooks be used?

Runbooks can be mapped to an alert as a configuration so that the agent can leverage it while investigating the alert better. Attaching the alert to a runbook is a one-time activity and future alerts of the same pattern. Alternately, runbooks can be attached to the chat on the go just like you attach files to Cursor for better context.

Do I need to configure or write code for automation?

No. The agent has been designed to automatically identify tools & data to query based on guidelines written in runbooks.

Where will the runbooks be stored?

Runbooks created on the platform will be stored on DrDroid. Alternately, you can connect DrDroid with BitBucket, Github or Confluence. In this case, your Wiki will be the source of truth and will be synced on the platform on a recurring basis.

Example Runbooks?

Explore these practical examples to get started with creating your own runbooks. Each example includes a description, use case, and the complete runbook definition.

1. Service Health Check

Use Case: Verify the health of a web service
While giving this runbook, I will mention the name of service either in an alert or in the prompt. To check the health of the given service, these are the 2 things that you should do:
1. Check the HTTP Status by hitting the url. https://service_name.exampleorg.dev/health . Keep a timeout of 5s.
2. Check response times of the service in this Grafana dashboard. grafana.exampleorg.com/d/uuid_dashboard_uid/ -- if you see an anomaly, check other data in the dashboard.
3. Check logs in loki with the query `error service:service_name`
Escalate to service owner if you find either of them are off.

2. Database Maintenance

Use Case: Perform routine database maintenance
When prompted with an alert or request for database maintenance on a given DB, follow these steps:

1. **Check Active Connections**  
   Connect to the Postgres instance and run:
   `sql
   SELECT count(*) as active_connections 
   FROM pg_stat_activity 
   WHERE datname = '${DB_NAME}';
   `
   - Replace `${DB_NAME}` with the actual database name.  
   - If connections > `${MAX_CONNECTIONS}` (default: 100), investigate further.

2. **Run VACUUM ANALYZE**  
   Execute the following command:
   `sql
   VACUUM ANALYZE;
   `
   - Reclaims storage and updates stats for the query planner.  
   - Set a timeout of **1 hour**.

3. **Run REINDEX**  
   Execute:
   `sql
   REINDEX DATABASE ${DB_NAME};
   `
   - This improves index scan performance.  
   - Timeout: **2 hours**.

4. **Escalation**  
   - If any operation fails or takes unusually long, escalate to the DB administrator.

3. Kubernetes Deployment Rollout

Use Case: Deploy a new version of a service
To deploy a new image version for a service, follow the steps below:

1. **Apply Deployment Manifest**  
   Replace the `${IMAGE_TAG}` and `${NAMESPACE}` in the following:
   apiVersion: apps/v1
   kind: Deployment
   metadata:
     name: myapp
     namespace: ${NAMESPACE}
   spec:
     replicas: 3
     selector:
       matchLabels:
         app: myapp
     template:
       metadata:
         labels:
           app: myapp
       spec:
         containers:
         - name: myapp
           image: myregistry.com/myapp:${IMAGE_TAG}
           ports:
           - containerPort: 8080
   - Run:  
     kubectl apply -f deployment.yaml

2. **Verify Rollout Status**  
   Check whether the new pods are running successfully:
   kubectl rollout status deployment/myapp -n ${NAMESPACE}
   - Timeout: **5 minutes**  
   - If rollout is stuck or fails, describe the pods and fetch logs:
     kubectl describe pod <pod-name>
     kubectl logs <pod-name>

3. **Escalation**  
   - Escalate to DevOps or the owning team if rollout verification fails.

4. Incident Response – High CPU Usage

Use Case: Automated response to high CPU usage
If CPU usage for the `api` service exceeds **90% for more than 5 minutes**, follow these steps:

1. **Identify Top Processes**  
   SSH into `${AFFECTED_HOST}` and run:
   ps -eo pid,ppid,cmd,%cpu --sort=-%cpu | head -n 10
   - Note which processes are consuming the most CPU.

2. **Capture Java Thread Dump (if applicable)**  
   If you find a Java process in the top list:
   jstack -l ${JAVA_PID} > /tmp/thread-dump-$(date +%s).txt
   - Replace `${JAVA_PID}` with the actual PID of the Java process.

3. **Scale the API Deployment (if needed)**  
   If `api` is the main contributor to CPU:
   kubectl scale deployment api --replicas=$((CURRENT_REPLICAS * 2)) -n production
   - Replace `CURRENT_REPLICAS` with the current replica count.

4. **Alert Message**  
   Prepare an alert message with top processes included:
   High CPU usage detected. Investigation required.
   Top processes:
   <paste output from step 1>

5. **Escalation**  
   - If CPU usage continues after scaling or thread dump indicates deeper issues, escalate to infra or backend team.