Agent Droid

Scale yourself with an agent that investigates production incidents and resolves them like you would

Agent Droid is an AI based tool that investigates production incidents, finds root cause in a matter of seconds and then recommends action items to fix the incident swiftly. It also offers automation of remediation action based on the investigation results.

How it works?

Agent Droid sits on top an engineering team's APM, Cloud, Logs & Deployment stack and can do the following:

  1. Notices a new alert on your Slack, Pagerduty or email workspace and decides where to look.
  2. Through a series of queries on upstream, downstream and other dependencies that it has identified as linked to the current service which created the alert, it finds which component is buggy and whats caused it.
  3. It identifies recent change that could have caused the buggy behaviour. It could be any of the following:
    1. Bad deployment
    2. External / Downstream service behaving incorrectly
    3. Unexpected Customer behavior
    4. Infrastructure issues (DB, Redis, Broker etc)
    5. Configuration change
  4. It automatically brings this root cause into your workspace next to the alert within 10 seconds of it being reported.
  5. Based on whether you have setup an automatic response to this result, it can modify your system to rectify this issue as well. It covers the following actions but is not limited to:
    1. Restarting / Rolling back a deployment
    2. Running a data fixing query on a DB
    3. Creating a pull request by analysing error stack-trace from the code
    4. Change a configuration to disable a feature that is causing the breaking

What access will be needed?

Agent must have access to all those data sources that you would go and query during an investigation. The deeper the integration, the more effective its results will be. Following are what it can query right now:

  1. AWS, GCP, Azure - For cloud Metrics
  2. Datadog, Newrelic, Prometheus - For service level metrics
  3. Elasticsearch, Cloudwatch, Google Logging - For querying logs
  4. Github Actions, Jenkins, Harness, CircleCI - For identifying latest deployments for different services
  5. Slack Channel - For reading past alerts to learn and to receive alerts that trigger investigation

How to sign up?

Agent Droid is currently in beta with limited coverage of tools it understands and remedial actions it can perform. It is evolving every week with new releases. For first hand access, sign up here. If you want to discuss on this with our team, reach out to our CTO or directly setup a discovery call with him.