Incident Management for Teams Without a Dedicated SRE: A Practical Guide

Most SRE content assumes you have a team of SREs. Here's how to build a real incident management process when you're a small team wearing every hat.

Rob

Founder, Strake

··9 min read

Most incident management advice assumes you have a real SRE function already in place. Dedicated rotations, formal roles, long severity docs, postmortem templates with twelve sections. That advice is useful in the right environment. It just doesn't map especially well to a smaller team where the CTO, the senior backend engineer, and the person who shipped the last deploy are all effectively part of the incident process.

If you're running with a lean engineering team and no dedicated SRE, the goal isn't sophistication. The goal is clarity. When something breaks, you want three things to be true: you notice quickly, the right person knows what to do next, and the team fixes the underlying issue often enough that the same incident doesn't keep resurfacing.

That's the version of incident management that actually helps when your current process is still "someone posts in Slack and we figure it out from there."

What You Actually Need (vs. What SRE Content Tells You You Need)

For a small team, the list is shorter than people make it sound.

What actually matters:

  1. You need to know something is broken before a customer tells you. That means basic monitoring and alerting. Nothing fancy, just reliable enough that you are not learning about outages from support tickets. I wrote more about that in your startup doesn't need better monitoring.

  2. You need a simple response path. Who gets paged, what they check first, where the incident lives, and when they pull in help. That can fit on one page.

  3. You need a lightweight habit of learning from incidents. Not a heavy postmortem ceremony. Just enough follow-through that the same issue doesn't bite you for the fourth time.

Everything else is secondary. SLOs, error budgets, review boards, chaos exercises, and the rest can be useful later. They are not the first thing standing between you and a workable incident process.

Building Your Incident Response Process From Scratch

Start with three severity levels, not five

I've found that three severity levels are enough for most small teams. More than that usually creates debate without improving the response.

P1 — the product is down or a core workflow is broken for everyone. Someone gets paged immediately. You stay on it until service is back, and if customers are clearly affected, you communicate early instead of waiting for a perfect explanation.

P2 — something important is degraded, but the product still basically works. Maybe a major feature is unstable, or a subset of users is having a bad time. This should get attention quickly, but it usually does not justify waking someone up overnight unless the business impact is unusually high.

P3 — something is wrong, but it can wait. A background job is failing, a dashboard is stale, or a non-critical dependency is acting up. This becomes a ticket, not a page.

The real value here is not the wording. It's the discipline behind it. A P1 should mean "wake someone up." A P3 should mean "nobody loses sleep." Once teams blur those lines, alert fatigue shows up fast.

Set up your on-call rotation

Three engineers is the minimum rotation I've seen hold up for more than a few weeks. With two people, someone is on-call every other week and starts to dread the whole thing. With three, it's still not luxurious, but it's survivable.

On tooling, this is one place where I would spend the money. Use PagerDuty or OpsGenie. Don't build a homemade paging system around Slack, calendars, and someone's phone settings. Alert routing at 3am is a solved problem, and solved problems are worth buying.

Create your war room protocol

When a real incident starts, you need a predictable place for it to live.

  1. The on-call engineer opens a Slack channel like #inc-2026-03-23-api-errors.
  2. They post the current state right away, even if the update is just "seeing elevated 500s, investigating."
  3. If they are still stuck after 10-15 minutes, they pull in the person closest to the affected system.
  4. They keep posting short updates on a fixed rhythm. Fifteen minutes is usually enough.
  5. When the incident is over, they leave behind a short summary of what happened, what fixed it, and what follow-up work is needed.

That is usually enough. Small teams do not need to invent every formal incident role they have seen in enterprise playbooks. If three people are involved, one of them can keep the channel updated while working the problem.

Build your first 10 runbooks

"Runbook" makes this sound heavier than it is. For a small team, a runbook is just a checklist for a known failure mode.

Start with the ten things that have already hurt you. For most startups, the list looks roughly like this:

  1. API returning 5xx errors
  2. Database connection failures
  3. High response latency
  4. Background job queue backed up
  5. Third-party API dependency down
  6. SSL certificate expired (yes, this still happens)
  7. Disk full
  8. Deploy broke something
  9. DNS issues
  10. Authentication/login broken

Each runbook only needs three things: what to check first, how to mitigate, and who to pull in if the first pass doesn't work. In practice that means a few dashboards, a few commands, a rollback or restart path, and a clear escalation point.

The standard I like is simple: could a reasonably capable engineer follow this at 3am while half-awake and either stabilize the system or know exactly who to call next? If yes, the runbook is doing its job.

The On-Call Rotation Reality for Small Teams

On-call at a startup is never going to feel glamorous, and pretending otherwise usually makes it worse.

Compensation matters. That can mean extra PTO, comp time, a monthly stipend, or some combination. The exact mechanism matters less than the signal that on-call work is real work. If someone gets dragged out of bed at 3am and is still expected to operate like nothing happened at 9am, resentment builds quickly.

Set a sane page budget. As a rule of thumb, outside-business-hours pages should be rare. If people are getting woken up multiple times a week, either the alerts are too noisy or the system is genuinely unstable. Both are fixable engineering problems.

Ease people into the rotation. New engineers should shadow first, then serve as backup, then take primary. On-call is stressful enough without making someone learn your systems and your incident process at the same time.

Runbooks lower the emotional cost. Most people can handle being paged occasionally. What really spikes the stress is waking up and feeling like there is no map. A decent runbook doesn't remove the pressure, but it changes the experience from "solve a mystery in the dark" to "work through a checklist and escalate if needed."

What to Track and Why

You do not need a massive reliability dashboard. Four numbers will tell you most of what you need to know.

Time to detect (TTD). How long does it take from breakage to awareness? If customers usually tell you first, your alerting is not doing its job.

Time to resolve (TTR). How long does it take from the first alert to a verified fix in production? This is the number that tells you whether incidents are annoying or truly expensive.

Incident frequency by service. Which part of the system keeps paging you? That is where your reliability work should go first, even if another problem feels more interesting technically.

Repeat incidents. What keeps coming back? This one is painful, but useful. Recurring incidents usually mean you only treated the symptom last time.

You can track all of this in a spreadsheet. The tool doesn't matter much. The habit does.

When to Hire a Dedicated SRE

Usually later than you think.

Here are the signals that it may actually be time to bring in dedicated SRE help:

  • A big chunk of engineering time is disappearing into operational work. If incident response, infra maintenance, deploy babysitting, and general firefighting are eating the team alive, the opportunity cost becomes real.

  • The on-call burden is consistently high. If engineers are getting paged constantly and the causes are infra-heavy rather than straightforward application bugs, that is often a sign that reliability needs more dedicated ownership.

  • You now have contractual uptime expectations. Once you are selling into larger customers with SLA language, uptime reporting, and incident expectations, someone needs to own that discipline full-time.

  • The system has outgrown shared context. When no one person can explain the major moving parts with confidence, the risk profile changes.

  • The team is large enough that coordination itself is becoming the problem. At some point, the process needs an owner even if the tech stack is still manageable.

Until then, the more immediate win is usually better operational visibility. The on-call engineer should not have to bounce between PagerDuty, Slack, GitHub, cloud dashboards, and three monitoring tabs just to answer the basic question of "what changed, what is broken, and who owns it?"

That's the problem we're focused on at Strake. Not replacing an SRE team, but giving smaller teams enough context to respond faster, understand what is failing, and stop relearning the same incident twice.

Strake is in beta and it's free to try. If you're a small team managing incidents with Slack threads and tribal knowledge, come take a look.


Rob is building Strake — an operational platform for startup founders that connects your tools, surfaces what needs your attention, and cuts the overhead of running a company before it buries you. Less time managing operations. More time building the thing.

If that's the problem you're living with, follow along or reach out on X.

Written by

Rob

Founder, Strake

ShareTwitterLinkedIn