Auto RCA

Get recommendations on investigations within your stack

Introduction

Auto RCA is an LLM-assisted intelligence by Doctor Droid that analyses your alert message, reviews the context of your existing stack and creates an investigation strategy which would likely help you reach to the root cause of an issue.

How it works?

AutoRCA works by leveraging the existing infrastructure & data in the following manner:

  1. Doctor Droid Slack bot notices a new alert on your Slack channel.

v0 -- [Only Slack]

  1. Our model runs an analysis on your past alerts and infrastructure composition (detected from past alerts).
  2. You receive a reply in the thread with recommended steps to find the root cause or fix the issue.

v1 -- [Slack + Data Sources]:

In Addition to the steps in v0:

  1. Our bot will also run queries basis our model's pre-determined investigation strategy (which you will be able to see) -- this could include deployment checks, metric queries or log queries. Our platform currently does not recommend auto-db queries.
  2. Interpret the output and recommend steps to fix the issue or investigate further.

v2 -- [Slack + Data Sources + Custom Playbooks]:

In addition to the steps in v1:

  1. Our bot will review if any of your existing playbooks is configured to that alert or is suitable for it. If yes, it will run the existing playbook.

  2. Interpret the output and recommend steps to fix the issue or investigate further.

What would our model analyse?

Our model can currently

  1. Traverse a trace to identify specific failing component and identify potentially buggy components
  2. Understand anomalies in any metric / log search
  3. Correlate deployments with anomalies
  4. Functioning of widely adopted infrastructural components and the meaning of their different metrics

Future:

Interpretation:

In the near future, the bot will also start interpreting your logs & metrics, helping automate parts of investigation..

Actions:

Based on whether you have setup an automatic response to this result, it can modify your system to rectify this issue as well. It covers the following actions but is not limited to:

  1. Restarting / Rolling back a deployment
  2. Running a data fixing query on a DB
  3. Creating a pull request by analysing error stack-trace from the code
  4. Change a configuration to disable a feature that is causing the breaking

Getting started

To get started with Auto-RCA, install our Slack bot in your workspace.