Alert Insights GA + Cancelling Playbooks

The last two weeks have been a lot of development on the alert insights bot. Here's what we delivered:

  • Native UI: Transitioned from Streamlit to Native UI for the alert insights dashboards.
  • Upgraded backend: Productised the entire insight generation algorithm.

These both combined has helped reduce the Time-To-Value from 4-12 hours to < 5 minutes.

  • Added intelligence: Auto-surface relevant insights by analysing the shape of the alerts data enriched for a team.
  • New Integrations: Google Chat Spaces (for alerts), Sentry (for data enrichment)

This is helping the user get actionable insights without having to search for it.

We've decided to not go ahead with playbooks feature in our current form.

Why? Most production troubleshooting are exploratory data analytics and asking users to codify EDAs is non-trivial and can become a significant bottleneck to adoption. Instead, we are doing the following:

  • Added intelligence: Frontloading the bot with intelligent auto-diagnosis algorithms that operate with ZERO upfront effort.
  • API-first data fetching: Until explicitly requested by power users, we are deciding to stick to observability tools with stable data APIs. We have made progress on Datadog APIs, New Relic NerdGraph and AWS Boto3 cloudwatch metrics. Early adopters will be able to able to get instant L1 diagnosis in their slack channels and / or observability tools.
  • Notification-led delivery: We are sticking to auto-response to slack alerts as the most prioritised channel for delivering these diagnosis. Sending diagnosis to Datadog via their Events API is also LIVE now.