AI-Powered Incident Investigation Case Study

Summary

A large e-commerce retailer running a Magento platform with 500K+ products was losing hours of senior engineering time to manual incident investigation.

Every incident required navigating multiple dashboards, correlating logs across services, and reconstructing timelines by hand before they could even start fixing the issue.

WiserBrand built an AI investigation agent that mirrors how experienced engineers approach incidents. It maps affected services, connects signals across systems, and delivers a structured investigation in minutes, so teams reach decisions faster without the manual overhead.

Cooperation Period

Ongoing

Location

USA

Industry

E-commerce

Technology Stack

PydanticAI

Claude

Elasticsearch

Zipkin

Qdrant

Business Challenge

Investigation, not fixing, was the bottleneck
The infrastructure spanned ~30 servers and multiple interconnected systems. Engineers spent 2–4 hours understanding incidents before they could fix anything.
Heavy reliance on senior engineers
Investigation required deep system knowledge and was handled mostly by senior engineers, limiting scalability and slowing response during peak load.
No consistent way to reconstruct incidents
Teams manually stitched timelines across logs with different formats and timestamps, leading to inconsistent analysis and uneven post-mortems.

What We Did

We built an AI agent that mirrors how experienced engineers investigate incidents, removing the manual work that slows them down.

1
Mapped the dependency graph
Before building the agent, we spent two months documenting how services connected, how logs were structured across systems, and where trace data lived. This became the foundation the agent uses to identify blast radius from any alert signal.
2
Built the multi-agent investigation core
The agent starts from an incoming alert, identifies affected services via dependency mapping, queries logs and distributed traces across systems, prioritizes signals by error rate and timing, and reconstructs the full event sequence. The reasoning loop mirrors how a senior engineer would approach the same problem manually.
3
Designed structured investigation output
Each run produces a review-ready report: probable root cause with a confidence score, cross-service event timeline, direct links to supporting logs, and suggested remediation steps. The output is built for validation, not autonomous action.
4
Tested and refined on real incidents
We used 80 hours of post-launch work to run the agent against historical incidents, identify gaps in coverage, and tune behavior for edge cases including third-party failures and incomplete trace data.

Project Results

The agent reduced investigation time from hours to under 30 minutes across the large majority of incidents, with measurable impact on both engineering capacity and recovery speed.

20–30 min

to a validated root-cause hypothesis

50–60 hrs

engineering time saved monthly

75–80%

incidents with full analysis coverage

Consistent

post-mortems for every incident

AI-Powered Incident Investigation

Summary

Business Challenge

What We Did

Project Results

AI Automation for Product Image Operations

AI Gift-Finder for Jewelry

AI-Powered Incident Investigation

Summary

Business Challenge

What We Did

Project Results

Related Case Studies

AI Automation for Product Image Operations

AI Gift-Finder for Jewelry