AI-Powered Incident Investigation

Key results

70–80%

Faster root-cause identification

20–30 min

per incident investigation

50–60 hrs

engineering time saved monthly

75–80%

incidents with full analysis coverage

incident case study

Summary

A large e-commerce retailer running a Magento platform with 500K+ products was losing hours of senior engineering time to manual incident investigation.
Every incident required navigating multiple dashboards, correlating logs across services, and reconstructing timelines by hand before they could even start fixing the issue.
WiserBrand built an AI investigation agent that mirrors how experienced engineers approach incidents. It maps affected services, connects signals across systems, and delivers a structured investigation in minutes, so teams reach decisions faster without the manual overhead.
Cooperation Period
Ongoing
Location
USA
Industry
E-commerce
Technology Stack
PydanticAI
Claude
Elasticsearch
Zipkin
Qdrant
incident case dashboard

Business Challenge

  • Investigation, not fixing, was the bottleneck

    The infrastructure spanned ~30 servers and multiple interconnected systems. Engineers spent 2–4 hours understanding incidents before they could fix anything.

  • Heavy reliance on senior engineers

    Investigation required deep system knowledge and was handled mostly by senior engineers, limiting scalability and slowing response during peak load.

  • No consistent way to reconstruct incidents

    Teams manually stitched timelines across logs with different formats and timestamps, leading to inconsistent analysis and uneven post-mortems.

What We Did

We built an AI agent that mirrors how experienced engineers investigate incidents, removing the manual work that slows them down.

  • 1

    Mapped the dependency graph

    Before building the agent, we spent two months documenting how services connected, how logs were structured across systems, and where trace data lived. This became the foundation the agent uses to identify blast radius from any alert signal.

  • 2

    Built the multi-agent investigation core

    The agent starts from an incoming alert, identifies affected services via dependency mapping, queries logs and distributed traces across systems, prioritizes signals by error rate and timing, and reconstructs the full event sequence. The reasoning loop mirrors how a senior engineer would approach the same problem manually.

  • 3

    Designed structured investigation output

    Each run produces a review-ready report: probable root cause with a confidence score, cross-service event timeline, direct links to supporting logs, and suggested remediation steps. The output is built for validation, not autonomous action.

  • 4

    Tested and refined on real incidents

    We used 80 hours of post-launch work to run the agent against historical incidents, identify gaps in coverage, and tune behavior for edge cases including third-party failures and incomplete trace data.

Project Results

The agent reduced investigation time from hours to under 30 minutes across the large majority of incidents, with measurable impact on both engineering capacity and recovery speed.
20–30 min
to a validated root-cause hypothesis
50–60 hrs
engineering time saved monthly
75–80%
incidents with full analysis coverage
Consistent
post-mortems for every incident