You're Migrating Off Opsgenie. Here's What You Should Actually Fix.

Opsgenie's end-of-support is April 2027. If you're on a small engineering team, you're probably mid-migration right now — comparing PagerDuty pricing tiers, reading incident.io vs. BetterStack threads, maybe resigning yourself to Jira Service Management because you're already deep in the Atlassian ecosystem.

I want to suggest something uncomfortable before you pick your next tool: alerting was never your actual problem.

I managed Opsgenie rotations at three different companies over the past eight years. FreightWaves, TextNow, Pilot Flying J. Different industries, different stacks, different team sizes. The pattern was always the same.

Someone would deploy a change. Something would break. Opsgenie would page the on-call engineer. That engineer would open a Notion doc titled "Runbook — Service X" that hadn't been updated since 2022. They'd mostly ignore it and Slack the person who wrote the service. That person would fix it. Everyone would move on. Two weeks later, something similar would happen again.

Opsgenie did its job perfectly. It routed the alert to the right person. The problem was everything else.

The question nobody was asking

At none of those companies — not one — did anyone ask the obvious question before deploying: is it safe to push right now?

Not "did CI pass." Not "did someone approve the PR." I mean: is the system healthy enough to absorb a change right now? Are we burning through error budget? Is there already an active incident? Did someone just deploy 20 minutes ago and we haven't seen the impact yet?

Nobody asked because there was no way to answer it. The information existed — scattered across Datadog, PagerDuty, GitHub, Slack — but nobody had assembled it into a single decision. So engineers deployed based on gut feel. "Seems fine." "I don't see anything in Slack." "The dashboards look okay I guess."

43% of incidents are preceded by a recent deploy. That number didn't surprise me at all when I first saw it. It matched what I'd lived through.

The runbook problem is worse than you think

Here's the thing about the Opsgenie migration conversation that nobody is having: most teams using Opsgenie didn't just use it for alerting. It was their entire incident process. Alert comes in, Opsgenie pages someone, that person figures it out. There was no structure beyond that.

The runbooks — if they existed — lived in Confluence or Notion. I wrote about this in incident management without a dedicated SRE, and the core problem hasn't changed: a runbook that's three clicks away from the alert that triggered it is a runbook that doesn't get opened at 3am.

I've seen this enough times to have a visceral reaction to it. The on-call engineer gets paged, opens Slack, asks "has anyone seen this before?" and waits. Meanwhile the customer is staring at a broken login page. The runbook that would have told them to check the config deployment and roll back the last change is sitting in a Confluence space that the engineer didn't even know existed.

Teams that connect their runbooks directly to their alerts — so the runbook opens automatically when the relevant alert fires — cut their mean time to resolution from 67 minutes to 23. That's not a marginal improvement. That's the difference between an incident that costs you a customer and one that costs you 20 minutes.

What "migrating off Opsgenie" should actually mean

If you're going to rip out a core piece of your incident workflow, this is the moment to ask harder questions than "which alerting tool has the best Slack integration."

The questions I'd ask:

Do you know whether it's safe to deploy right now? Not in a gut-feel way. In a "here's your error budget status, here's your active incident count, here's your deploy velocity over the last 24 hours" way. If you don't have that, you're going to keep causing the incidents that your new shiny alerting tool routes to your team.

When someone gets paged, do they know what to do? Not "figure it out." Actually know — because the runbook showed up in front of them automatically, with the steps they need and the context about what changed. If your runbooks are still in a wiki, your migration isn't going to fix the thing that actually hurts.

Are you learning anything from your incidents? Not in a blameless-postmortem-Google-Doc way. I mean: does your system know that this service broke last Tuesday after a similar deploy? Does your deploy process incorporate the history of what's gone wrong before? Most teams I've worked with have zero institutional memory. Every incident is treated as a surprise, even when it's the third time it's happened.

Don't just swap your paging tool

The Opsgenie shutdown is a forcing function. Use it.

If you just swap Opsgenie for PagerDuty or BetterStack, you'll have the same problem in a different UI. Engineers deploying blind. Runbooks gathering dust. Your on-call rotation burning people out because every incident starts from scratch. I wrote about the monitoring version of this trap in your startup doesn't need better monitoring — the tooling isn't the bottleneck. The process is.

The actual fix is a layer that sits before your alerting tool. A deploy gate that tells your team whether it's safe to push. Connected runbooks that show up when things break. Incident data that compounds into institutional knowledge so your team stops relearning the same failure every quarter.

That's what I'm building at Strake. It's in private beta right now and it works alongside whatever alerting tool you pick — PagerDuty, BetterStack, Grafana OnCall, whatever. The deploy gate and the runbook layer are the parts that were always missing, regardless of who was routing the page.

If you're mid-migration and want to talk through how your team handles deploy safety, I'm happy to jump on a call. Not a pitch — I'm genuinely trying to learn from teams going through this right now.

Rob is building Strake — a deploy gate and incident workflow platform for engineering teams without dedicated SRE coverage. If your current incident process is "someone posts in Slack and we figure it out from there," come take a look at strake.dev.

You're Migrating Off Opsgenie. Here's What You Should Actually Fix.

You're Migrating Off Opsgenie. Here's What You Should Actually Fix.

The question nobody was asking

The runbook problem is worse than you think

What "migrating off Opsgenie" should actually mean

Don't just swap your paging tool

Related Posts

Incident Management for Teams Without a Dedicated SRE: A Practical Guide