There is a specific kind of professional experience that doesn’t appear on many CVs but that most long-tenure support leaders have lived through: the period where the product is unstable, the outages are chronic, and your job becomes less about resolving individual incidents and more about managing a sustained state of organisational distress — for your customers, for your team, and for yourself.
I went through a version of this early in my career. Two years at an organisation where a mix of services was in a near-constant state of degradation. No single customer experienced the full two years — the specific failures rotated across products and regions — but as the support leader, I carried the cumulative weight of it. I was flown across the country and around the world to sit across from customers whose businesses were being impacted by services we had promised would work. I got very good at delivering bad news, maintaining relationships under pressure, and pushing internally for fixes that were perpetually just around the corner.
What follows is what I learned. Not the polished version — the real one, including what I got wrong.
The first thing to understand: chronic outages are a different problem from acute outages.
A single major incident — a P1 that takes down a critical service for four hours — is a crisis. It’s acute, it’s bounded, and the playbook for managing it is well-established. The incident bridge, the status page updates, the RCA document, the client communication — all of that exists to manage a defined event with a beginning and an end.
Chronic instability is different in kind, not just degree. When customers have experienced three outages in six weeks, the fourth outage lands differently. The client who was patient after the first incident, understanding after the second, and frustrated after the third has, by the fourth, begun making a calculation — not about this outage specifically, but about whether the pattern they’re observing reflects the company they chose to trust with their operations. That calculation doesn’t reset when the fourth incident is resolved. It accumulates.
Managing customer relationships through chronic instability requires understanding that you’re not managing individual incidents anymore. You’re managing a narrative. And the narrative customers construct from a series of incidents is heavily influenced by what happened between the incidents, not just during them.
Communication between outages is more important than communication during them.
Most support organisations have a reasonable protocol for communicating during an active outage. Status page updates, bridge calls, regular client emails, executive escalation paths. These matter and they need to be done well. But in a period of chronic instability, the communication that most determines whether you retain customers is the communication that happens when everything is working.
When an outage resolves and the immediate pressure releases, the instinct is to move on. The queue needs attention, the team is tired, and revisiting the incident feels like dwelling. This is exactly backwards. The period immediately following an incident — when the client is still processing what happened but is no longer in active distress — is the highest-value window for relationship repair.
What that communication looks like in practice: a direct call from a named relationship owner within 24 hours of resolution, not to apologise again but to check in. A written RCA within 48 hours that demonstrates you’ve understood the root cause and have a specific plan to prevent recurrence — not a template, a genuine account of what failed and what you’re doing about it. A follow-up call at two weeks to confirm the corrective actions are in progress.
Clients who receive that sequence — who feel that the vendor takes each incident seriously and follows through on commitments made after it — have a qualitatively different experience of the same outage history than clients who receive an apology email and then silence. The research on customer loyalty in service contexts is consistent on this point: what drives retention after a service failure is not the quality of the initial response. It’s the perceived effort of the follow-through. The communication between outages is the follow-through.
Internal advocacy is the part nobody talks about.
Managing customer relationships through chronic instability requires something that most support leaders underestimate: the sustained internal effort of keeping engineering and product leadership continuously aware of the customer impact of the instability they’re managing.
Technical teams working on reliability problems are usually aware that outages are occurring. What they’re often not aware of — not viscerally, not in a way that affects their prioritisation — is what those outages cost on the customer side. Not the SLA penalty or the contractual exposure. The actual human cost: the customer whose product launch was delayed because the platform went down during setup, the enterprise client whose board presentation was disrupted because the reporting tool failed that morning, the customer who has started preparing migration documentation because they’ve decided internally that this vendor isn’t sustainable.
Your job, in a sustained period of instability, is to make that customer reality present in the rooms where engineering decisions are made. Not as a complaint — complaints produce defensiveness — but as data. Customer impact translated into business terms: this many enterprise accounts have now experienced more than three incidents, this is the renewal risk concentration, this is the contractual exposure if SLA credits are claimed at the current incident rate. The language of customer relationships doesn’t always land in technical conversations. The language of business risk usually does.
This is the case for Problem Management as a distinct discipline from Incident Management — the structured process of identifying the systemic causes of recurring incidents and tracking the corrective actions through to completion. Without a formal Problem Management process, engineering teams are perpetually triaging the next incident rather than addressing the conditions that produce them. With it, the pattern of recurring failures becomes visible and trackable, and the accountability for corrective action sits with named owners rather than disappearing into general backlog.
What chronic outages do to your team — and what to do about it.
This is the dimension of sustained instability that gets the least attention in the operational literature and the most attention in the careers of support leaders who’ve lived through it.
Support agents handling chronic outages are doing something qualitatively different from agents handling normal incident volume. They’re not just resolving tickets — they’re absorbing client frustration that isn’t about them personally but that lands on them personally. They’re managing relationships on behalf of an organisation that, through no fault of their own, keeps failing those relationships. They’re explaining the same situation — again — to a client who has heard the same explanation before and is increasingly disinclined to accept it.
This is the structural mechanism behind burnout in support teams. The emotional labour of sustained customer-facing distress, compounded by the helplessness of seeing the same problems recur, is one of the most reliable routes to the exhaustion stage. And unlike a single high-intensity incident, chronic instability doesn’t have a clear end point that agents can orient themselves toward. There’s no moment when the crisis is resolved and the team can exhale. There’s just the next week.
What works at the leadership level during these periods: radical transparency about what you know and don’t know. Agents who understand the root cause of the instability — even if that understanding is “we have a systemic reliability problem that engineering is working to address over the next quarter” — handle client conversations better than agents who are as surprised by each outage as the customers are. It’s also more honest, and clients tend to respect honesty about a hard situation more than they respect a carefully managed communications line that feels evasive.
Protect your best people’s capacity. In a sustained crisis, the instinct is to put your strongest agents on the hardest conversations — which concentrates the emotional burden on exactly the people you can least afford to burn out. Rotate deliberately. Build the exposure to difficult client conversations into the schedule as a managed allocation rather than letting it accumulate on whoever is willing to take it.
And be honest with your team about the organisational situation. The agents on your team are adults making career decisions. If the instability is temporary — a product going through a difficult migration, a scaling challenge that has a defined resolution plan — say so. If it’s unclear, say that. The information vacuum that forms when leadership says nothing gets filled with rumour, and rumour is almost always worse than the truth.
When to escalate, and when to stop absorbing on your company’s behalf.
There’s a version of the support leader’s role during chronic instability that is quietly unsustainable: the person who is so focused on retaining customer relationships that they are absorbing, on behalf of the organisation, consequences that the organisation should be feeling directly.
Support leaders are very good at protecting their organisations from the full weight of their own reliability failures. We communicate well, we manage relationships, we apply SLA credits and we follow up diligently, and the result is that customers stay — sometimes past the point where the rational calculation of staying versus leaving should have produced a different answer. This is often framed as success. It is sometimes a sign that leadership above you has become comfortable with a level of instability that they shouldn’t be, because the consequences are being managed away by the support team rather than being felt by the people who can actually fix the underlying problem.
Escalating clearly — presenting the business risk in terms that leadership can’t manage away — is part of the job. The data from your incident management process, the renewal risk concentration, the volume of client escalations that have reached executive level — these are numbers that belong in front of the people who control the engineering investment decisions. If you’re in a position where you’re regularly delivering that data and it’s not changing the investment priority, that’s a different problem. But the first step is making sure the data is reaching the right people in the right language.
The period I described at the beginning of this post ended not because I managed customer relationships perfectly — though I worked very hard at that. It ended because the organisation eventually made the reliability investments it had been deferring. Two years is a long time to hold a customer base together through sustained instability. The support function did its job. The long-term fix required the engineering and product function to do theirs.
Also: How to Handle a C-Level Escalation Without Making It Worse
And: Burnout in Support Operations: Why It’s Almost Always a Structural Problem, Not a Personal One
Hutch Morzaria is a CX and Support Leadership professional with 19 years of experience building and leading support organizations across SaaS, Fintech, and enterprise technology. He has held Director-level roles at Q4 Inc, AudienceView, Johnson Controls, and others, and holds ITIL Expert certification across V3 and V4.

