More than 100 million Americans traveled during the December 2022 holiday season. For those taking one of Southwest Airlines’ (SWA) 15,000 canceled flights, it was a miserable time. It appears that severe weather overwhelmed old technology and created a cascading disaster that took days to fix. Now comes the aftermath: repairing the substantial damage SWA did to its brand.
If you’re a CXO or board director, you’re probably asking, “Could this happen to us?” I’m afraid the answer is “Yes.” Consider SWA’s public shaming as a wake-up call to review your technology oversight. In this analysis, we’ll explore four areas: capacity risk, project risk, disaster risk, and technical debt, all of which need to be reviewed to prevent your business from undergoing its own, very public meltdown.
Does your industry have busy times? Maybe it’s holiday travel; midsummer electricity surges; periodic peaks (like “late payment day,” the 10th of the month in the mortgage business); or special events (World Cup is coming to Dallas). Are your systems and processes sized to handle these busy times?
What about if you’re at a peak time, and the flu has 10% of your team out sick? Or if a freak weather event adds additional complexity to your operations? What if you’re in the middle of an acquisition or major system upgrade when a peak event hits?
IT Oversight Tip #1: Understand industry peaks, and stress test systems/processes to handle expectations — plus “headroom.”
Can your organization handle major system/process upgrades? It’s relatively easy to plan and budget for a technology system installation/upgrade: If it’s a purchased technology product, the vendor will hand information technology (IT) a “sample project plan.” But guess what? The “techie stuff” most IT departments focus on represents less than half the work/time/cost/risk of a true “get everyone to use the new processes/technology tools effectively (and stop using the old way)” project.
Project implementation failures run the gamut from “merely” expensive (e.g., the ERP that costs double and takes twice as long as budgeted) to existential (e.g., FoxMeyer Drugs bankruptcy).
Why do these failures happen? I could write 10,000 words cataloging all the ways my clients have failed. At a high level, what I’ve mostly seen across 30 years and numerous industries is that IT departments are under pressure to deliver solutions cheaply and that business units, also under pressure, abdicate responsibility for business process change to IT.
IT Oversight Tip #2: Ensure that major technology projects are scoped and resourced as organizational change management projects rather than as technology projects, and that adequate contingency costs are part of the budget. Assign a senior executive or experienced consultant as project owner and incentivize the C-suite to remain engaged with the process and the outcome.
Every mid-size and larger organization has a “disaster plan” that purports to help the organization recover from a range of calamities. Reality check: Most such plans exist in dusty binders designed to satisfy auditors and won’t work when needed.
The four main problems I’ve seen are”
- Lack of imagination when defining “disaster” types or scope
Everyone knows about fire, flood, power, or network outage. What about a railcar full of chlorine derailing next to a critical business location? How about 10,000 gallons of sugar syrup cascading down from the roof? (True story: My dad ran a candy factory that stored liquid sugar in a roof tank. When the supplier’s pump truck overfilled the tank, highly corrosive syrup ran down stairwells and inside walls rendering the factory unusable for two years while it was gutted and rebuilt.) What about scope? One facility is easy to plan for. What about a regional disaster (hurricane/earthquake/snowstorm) that affects your organization along with employees’ families plus your regional supply chain? Or a global pandemic, for that matter?
- Viewing disaster planning as an IT problem
Disruptions to operations are an organizational problem, not a departmental or technology problem. When a hurricane hits, and employees are trying to get families to safety, they’re not coming to work!
- Technology stack changes
Suppose you maintain a “disaster site” (an outdated notion). You’ll discover that starting it up when a disaster hits (in the middle of the night on a holiday weekend while a storm is raging) will not go smoothly because parts will be missing or broken, and changes to your “production” environment aren’t mirrored at your disaster risk (DR) site.
Even organizations that are serious about DR often forget that moving back to the primary site(s) when the disaster ends can be quite complex, with the need to get data synchronized and systems running normally.
- Viewing disaster planning as an IT problem
IT Oversight Tip #3: Move from a “disaster planning” mindset to a “continuous distributed availability” mindset, in which any local or regional disaster shifts work from affected sites to other sites in unaffected areas. If you can’t adopt that approach, stress test your disaster plans using business outcomes (e.g., process customer orders) rather than IT goals, such as “Get server ‘x’ running.”
The SWA story is full of references to “technical debt,” which I recently discussed in-depth with Bob Evans. Let’s define it: Technical debt is the sum total of maintenance to IT components (hardware, software, network) that should have been performed but wasn’t. Like any complex machine — a car or a car factory — your IT “stack” of hardware and software needs maintenance. Physical devices break or wear out; wires fray; trading partners change their systems in ways that must be accommodated; hackers discover weaknesses that require patching; vendors update their software to fix bugs and add features.
It can be tempting to defer IT maintenance to deliver new capabilities, which generate goodwill and bonuses for the IT team. After all, nobody gets praised for updating stuff that seems to be working — especially when some of those updates cause problems for employees and customers.
But, while it’s OK to manage the IT maintenance project backlog to minimize disruptions and favor cool new features, deferring maintenance for months or years has dire consequences, as the world recently witnessed with Southwest Airlines.
Mounting technical debt weakens the IT stack slowly, but the manifestation of that weakness happens suddenly and often when the organization is stressed by added volume (See “Capacity Risk” above), by new systems being implemented (See “Project Risk”), or by external pressures (See “Disaster Risk”)—i.e., when the organization is most vulnerable, and resources are stretched to the limit.
The biggest issue with technical debt is that it’s usually undocumented and unmanaged: IT rarely keeps a list of maintenance that isn’t being done; neither GAAP (generally accepted accounting principles) nor IFRS (International Finance Reporting Standards) require any documentation; business leaders haven’t been trained to ask about it. Nobody is incented to fix it. I think of technical debt as an “off-balance-sheet liability” that will need to be repaid someday.
IT Oversight Tip #4: Business executives must consider technical debt when making business decisions, including mergers-and-acquisitions decisions. Projects must be budgeted based on “total lifecycle costs” rather than up-front implementation costs alone (CFOs, I’m looking at you). CIOs must be brave enough to resist pressure to “do more with less” when budgeting projects, especially when digging out of a technical debt hole (What’s the first step when filling in a hole? Stop digging!).
And someone with the right connections, please convince FASB (Financial Accounting Standards Board) and PCAOB (Public Company Accounting Oversight Board) to put technical debt on firms’ balance sheets so investors can gain a clearer picture of the ever-increasing risk firms face as automation becomes more pervasive!
What does all this mean to an organization’s top management and governance executives?
- Understand all your technology risks, not just cybersecurity
- Understand your technical debt, and how it aligns with your risk tolerance
- Recruit a qualified technology expert (QTE) director to the board to strengthen oversight of technology risks and opportunities
Looking for more insights into all things data? Subscribe to the Data Modernization channel: