Motivation
Businesses which are dependent on real-time functioning of software to generate revenue mostly rely on engineers to debug any deviations in business metrics or customer facing issues. This requires engineers to switch context every now and then reduces the time they can spend on new features. Doctor Droid is solving this pain by adding a auto triaging layer which connects business outcomes to events generated from within the code. Kind of like creating a correlation between application logs & metrics to business success.
Solving this pain point will make sure engineers do what they do best i.e. build. Companies often specifically hire tech support teams to triage customer facing issues by tracing that back to their data flowing through the applications. We are automating that.
We used to work in a fast paced logistics company that specialises in food and grocery deliveries under 30 minutes. That was spread across hundreds of merchants, hundreds of cities & their operations teams and with thousands of delivery partners. It was built using more than 50 micro-services and there were a lot of point of failures that could fail in specific scenarios for different stakeholders. We found the triaging issues at a delivery partner, merchant or order level was extremely hard unless there is a mental map that connects the questions to the answers. Rarely were we told “API x is down”. Most often, we would face questions like:
- Why is delivery partner A not able to upload their documents?
- Why are we receiving 14% lesser orders today than last week same day from New Delhi?
- How much should we improve our delivery partner allocation algorithm to increase orders by 5%?
Only senior engineers who had built those products could debug these and that created a bottleneck in both tech support as well new developments. These questions can only be answered by complex analysis into data from different sources, the data captured by different lenses for the same product - sometimes also involving multiple teams. We solved it back then by writing custom scripts that could poactively such issues before being reported so that that work doesn’t have to be done on-demand. We also hired specific junior engineers to follow a process & check in 5 different parts of our observability and monitoring stack to find the rot cause for each issue.
There should be a simpler & faster way to do this.
And that’s what we are working on.
Rooting for a more productive future,
Product Status:
We have launched our journey into building this by creating a simple stateful monitoring tool.
- As an engineer, you can either pass events to Doctor Droid using our APIs or SDK or connect an existing event source like ElasticSearch, Segment or Cloudwatch with us.
- Using these events, you can create a stateful monitor on the relative behavior between the events. The relation we support today is time taken between them. For the same entity like order or payment, you can see how far apart are these events happening from each other or if some of them are not happening at all. This way, you can create a report that is generated periodically and tells you this behaviour.
- This stateful monitor has many applications for monitoring software driven impact on critical business process like Signup using OTP, Payment refunds, food order delivery etc.
We have a lot more lined up and you can read about it, here.
Updated 2 months ago