Use case · DevOps / SRE

Log analysis

Quickly identify the root cause of an incident by analyzing voluminous and heterogeneous logs (application, infra, network).

During a production incident, log analysis is one of the most time-consuming phases: navigating between Kibana, CloudWatch, Datadog, finding abnormal patterns, correlating across services. AI lets you save precious time when every minute counts (SLA, degraded UX, business loss). Used well, it can divide MTTR by 3. The challenge: not substituting experienced operator's judgment to AI suggestions. This guide presents the AI-assisted incident workflow and pitfalls to avoid under pressure.

Step-by-step workflow

Collect relevant logs
Identify incident time window, involved services, export logs (filtered by severity, error/warn/critical). Limit to manageable volume (10-50k lines max for efficient AI analysis).
Pseudonymize sensitive elements
Before sending to AI: remove or mask tokens, secrets, internal IPs, user identifiers, personal data. Non-negotiable, even in emergency.
Submit with incident context
Describe incident to AI: observed symptoms, impacted services, recent modifications (deployments, configs), start time. Richer context = more targeted analysis.
Request structured analysis
Not 'what's happening?' but 'identify: (1) probable root error, (2) propagation chronology, (3) alternative hypotheses, (4) diagnostic commands to run'. Structured format accelerates decision.
Validate with targeted tests
Never act on AI analysis alone. Run diagnostic commands (ping, curl, kubectl describe, etc.) to confirm hypothesis before any corrective action.

Copyable prompts

2 tested and optimized prompts. Adapt the bracketed variables [VARIABLE] to your context.

Incident analysis from logs

You're a senior SRE. I have a production incident:

**Symptoms**: [DESCRIPTION — e.g., 5xx errors increasing, degraded latency, service down]
**Impacted services**: [LIST]
**Start time**: [TIMESTAMP]
**Recent modifications**: [DEPLOYMENTS, CONFIGS, INFRA]

**Logs (pseudonymized)**: [PASTE LOGS]

Produce:
1. **Probable root cause**: what broke first?
2. **Chronology**: which event triggered what (with timestamps if visible)
3. **Alternative hypotheses**: 2-3 other tracks to investigate
4. **Diagnostic commands** to run immediately to confirm/refute
5. **Quick corrective action** (workaround) if possible without additional risk
6. **Deep corrective action** to plan post-incident

Stay lucid: if unsure, say so. No invention of plausible non-sourced causes.

Datadog/Kibana query generation

For this investigation question:

[QUESTION — e.g., 'find slow requests (>2s) on /api/checkout in last 24h, grouped by user agent']

Generate query for [DATADOG / KIBANA / SPLUNK / CLOUDWATCH INSIGHTS].

Provide: (1) complete query ready to paste, (2) explanation of fields and operators used, (3) useful variants, (4) performance pitfalls to avoid (full scan vs index).

Top tools for this use case

Curated selection of the 3 best AI tools for log analysis.

Claude Opus 4.5

★ 4.9/5· 92 reviews·20 USD/month

Why for this use case: The best reasoning on complex causal chains ('error A → cascade B → impact C'). Precise on alternative hypotheses.

Try Claude Opus 4.5 →Full review

Claude Code

★ 4.9/5· 92 reviews·20 USD/month

Why for this use case: For project-context analysis: can access local logs, runbooks, Dockerfiles. Ideal for deep investigation.

Try Claude Code →Full review

ChatGPT

★ 4.9/5· 528 reviews·20 USD/month

Why for this use case: Code Interpreter excellent for parsing voluminous logs (>1M lines), aggregating, visualizing patterns in seconds.

Try ChatGPT →Full review

Estimated ROI

Time saved

50-60% MTTR reduction on complex incidents

Quality gain

Systematic hypotheses, clear chronologies, richer post-mortems

Stack cost

Included in Claude Pro / ChatGPT Plus subscriptions

Estimates based on 2026 benchmarks and user feedback. Actual ROI depends on your context.

Frequently asked questions

Can production logs be sent to a public LLM?

Not as-is: secret leak risk, tokens, personal data. Solutions: (1) systematically pseudonymize before sending, (2) use Claude for Work / ChatGPT Enterprise (no-training), (3) self-hosted (Ollama, vLLM) for most sensitive data.

Can AI really find a root cause?

On 'classic' incidents (configuration, failed deployment, expired certificate, OOM): often yes in minutes. On subtle incidents (race conditions, distributed bugs, silent corruption): proposes leads but human expertise stays central. Assistant, not oracle.

How to integrate AI into observability platform?

Several approaches: (1) Datadog/New Relic have integrated AI copilots, (2) MCP (Model Context Protocol) to plug Claude on your log sources, (3) custom script that pulls logs and calls API. 2026 standard emerges around MCP.

Bash and Python scripts

Produce in minutes robust automation scripts (deployments, backups, monitoring) that would take 1-2h to write from scratch.

← Back to DevOps / SRE page

See the full stack and all use cases for this profession.

Transparency: some links are affiliate links. No impact on our evaluations or prices.

← Back to DevOps / SRE

Log analysis

Collect relevant logs

Pseudonymize sensitive elements

Submit with incident context

Request structured analysis

Validate with targeted tests

Incident analysis from logs

Datadog/Kibana query generation

📜Bash and Python scripts

← Back to DevOps / SRE page

Bash and Python scripts