When the Grid Goes Dark

It started over a beer

A friend of mine owns an electrical contracting company. Smart guy, runs a tight crew, good business. We were sitting around talking shop and he started walking me through what emergency dispatch actually looks like for him - generator failures, storm callouts, middle-of-the-night outages. The whole operation runs on gut feel, phone calls, and whoever picks up. He knows from experience which of his guys is fastest on transformer work. He knows which neighborhoods have the worst access during rain. It all lives in his head.

About halfway through the conversation, we both said the same thing at the same time: if this is what it looks like for a small contractor with a handful of crews, what does it look like for a utility company managing 30 simultaneous outages in the middle of a February ice storm?

We already knew the answer. It looks like the same thing, just with more zeros, more people, higher stakes, and the same institutional knowledge locked in someone's head - except that person is now running a storm bridge call with 40 supervisors asking for updates.

The actual problem: Electrical utilities have decades of outage data, thousands of sensors, and experienced field crews. What most of them don't have is a systematic way to use any of that before a storm hits. They respond. They don't predict. And the difference between those two modes, when you're talking about 8-hour restoration windows and $150 billion in annual US economic losses from power outages, is enormous.

I've spent 10+ years designing and operating large-scale systems across industries. The failure patterns in a stressed electrical grid and a stressed software platform are nearly identical. Same root causes, same organizational behavior under pressure, same failure to use data that's already sitting there. Good architecture fixes both. The instinct is the same whether you're designing a distributed system or a dispatch model for 400 field crews.

Failure patterns: they're not random

The first thing any good engineering team does after a bad incident is ask: has this happened before? Almost always the answer is yes. Grid failures work the same way, the same 15 to 20 percent of assets account for 70 to 80 percent of storm-related outages, every single year. That's not bad luck, that's a data problem masquerading as an operations problem.

Pull five years of outage management system (OMS) data, join it with NOAA storm track records, and map failure frequency by geographic segment and equipment type. You'll have your Pareto list within a week. This isn't exotic analysis, it's the same pattern matching any architect uses when profiling a system for bottlenecks.

The failures themselves cluster into three archetypes. Each one behaves differently, costs differently, and requires a different response.

Archetype 01 Cascade failure

One feeder trips, downstream load redistributes instantly. Adjacent feeder tips over the edge within minutes. This is the one that hurts most, a single cascade during a storm event can account for 60 to 70 percent of total customer outage time.

Trigger: Transmission line fault, ice loading, voltage instability under redistributed load.

Archetype 02 Spot failure

An isolated transformer or distribution pole goes down. Clean blast radius, limited spread. These are the most common and the most manageable - if you have good crew routing and parts pre-staged nearby.

Trigger: Wind damage, tree contact, direct physical impact to the line or pole.

Archetype 03 Latent degradation

Equipment that was already failing on its own timeline finally gives out under storm stress. No visible external damage. No obvious cause. Just an aging transformer that picked the worst possible night to stop working.

Trigger: Equipment age past rated lifespan combined with thermal stress, not any single weather event.

Architect's parallel

Cascade failures in a grid look almost identical to cascading failures in distributed systems - one overloaded component starts dropping requests, others pile on, the whole system tips over. The fix in software is circuit breakers and load shedding. The fix in a grid is topology-aware switching and pre-planned load redistribution. Same concept, different medium. This is what thinking like an architect buys you - pattern recognition that crosses domains.

Prediction over reaction: the cost math

Most utilities today operate in fully reactive mode. They know something failed when a breaker trips or a customer calls. The shift to predictive costs some upfront instrumentation investment. What it saves on the back end is significant. Here's a realistic look at how a single un-predicted cascade failure costs out over a storm season versus one where the feeder was flagged and pre-staged before the storm arrived.

Reactive

8.5hr avg restore - crew drives from depot, cold start, no pre-staged parts

8.5 hrs

Pre-staged

3.2hr avg restore - crew 12mi out, parts on truck, runbook in hand

3.2 hrs

Delta

5.3 hours of unnecessary outage per restore event, multiplied across hundreds of outages

-5.3 hrs

Season

Across 200 major events: ~1,060 hrs of avoidable downtime per storm season

1,060 hrs

The math: 200 restore events. 5.3 hours of avoidable outage each. That's 1,060 hours of customer impact per storm season that better pre-positioning eliminates. The instrumentation pays for itself in the first event.

What signals actually matter

Not all sensor data is equal. After mapping what actually predicts failures versus what just adds noise, the highest-signal inputs are these, in order of operational usefulness:

Feeder load vs. rated capacity

How close to the edge a circuit is before storm stress lands. A feeder running at 87% capacity has very little headroom when the next feeder trips and redistributes load onto it.

Lead time: Hours to days. Actionable before the storm arrives.

Transformer temp delta

Thermal runaway risk on aging equipment. A transformer running hotter than its baseline under the same load is telling you it's degrading. It's the equivalent of elevated p99 latency on a service that's about to fall over.

Lead time: Minutes to hours. Triggers a maintenance flag, not just a storm flag.

Smart meter cluster velocity

Meter pings going dark in sequence is the earliest real-time indicator of a cascade in progress. When you see 40 meters drop in the same 3-minute window along a feeder path, you know the direction of spread before the field crew does.

Lead time: Real-time, 2 to 5 minutes. Informs dispatch, not prevention.

Wind + ice accumulation forecast

Combined with span length data, you can model line sag and contact risk for specific segments before the storm arrives. This is where weather model integration earns its keep.

Lead time: 12 to 72 hours. The primary input for pre-positioning decisions.

Crew dispatch: the gut-feel problem

This is the part of the conversation where my electrician friend and I really connected. His dispatch system is a group text and 15 years of knowing his guys. It works because his company has 8 crews and he has mental context on all of them. Scale that to a utility with 400 crews, mutual aid from three neighboring states, 300 active outages, road closures, and a storm still moving through, and that mental model collapses completely.

At its core, crew dispatch is a vehicle routing problem. You have crews (vehicles), outage locations (stops), restoration time estimates (service times), and priority tiers (weights). Logistics companies solve harder versions of this in real time. Utilities can too. The only missing ingredient is the will to instrument it properly.

Priority tiers that actually hold up under pressure

Priority 1 - Life safety Dispatch first, no negotiation

Hospitals and dialysis centers
Assisted living facilities
Medical baseline customers
Traffic signal cluster outages
Water treatment and pumping

Priority 2 - High leverage Upstream fixes, stop the spread

Transmission faults 1,000+ customers
Substation-level failures
Active cascade - isolate first
Commercial districts, economic exposure

Priority 3 - Batch efficiently Geographic clustering

Spot outages in same corridor
Standard transformer failures
Distribution pole damage, accessible roads

Priority 4 - Defer Post-event or contracted

Single-customer, no safety exposure
Equipment damage with no active outage
Non-storm backlog

The thing most people miss

Dispatch optimization isn't a one-time decision at the start of a storm event. It has to re-optimize continuously as new outages come in, roads close, and crews finish jobs. A plan locked at 7pm when the storm hits is stale by 9pm. The model should be rerunning every 15 to 20 minutes, same as any dynamic routing system reweights based on current conditions. Static plans fail dynamic storms.

Crew fatigue state matters too. After 14 hours, error rates climb and safety risk goes up. A dispatch system that doesn't track time-on-task isn't really a dispatch system - it's a very expensive call log.

Pre-positioning: the highest-leverage move nobody does

If I had to pick one thing that would have the largest impact on utility storm response, this is it. Pre-positioning means staging crews, parts, and mobile command near predicted failure zones before the storm makes landfall. Not waiting for outages to happen and then driving across a storm-damaged region to respond.

72 hours out - run the model

Feed your historical failure frequency map into your weather forecast. Flag the top 20 percent of at-risk segments for the storm track. Brief mutual aid partners on expected call volume and crew type mix needed. At this stage you're not moving anything yet - you're building the picture and aligning on the plan.

Key input: NWS 72-hour probabilistic forecast track plus your Pareto failure map.

48 hours out - stage resources

Move mobile command units, transformer stock, and pole inventory to forward staging locations. Pre-position 30 to 40 percent of field crews within 15 miles of the predicted high-impact zone. This feels aggressive if you've never done it before. It feels obvious after the first time it saves you 4 hours per restore on 200 events.

Key decision: If NWS gives you 70 percent or higher probability of significant impact, pre-activate mutual aid. Don't wait for confirmation after the storm hits.

24 hours out - lock the plan

Confirm all road access routes - and identify alternates for the ones that typically flood or get debris-blocked in your region. Brief all crews on the priority tier system. Set the storm bridge call cadence. Make sure every dispatcher has a printed copy of the dispatch priority tiers, because systems go down during storms too.

Key risk: Crews staged without clear priority guidance will self-triage incorrectly under pressure. Runbooks matter here.

During the event - re-optimize continuously

Run your dispatch model on a 15 to 20 minute cycle. Track crew locations and time-on-task in real time. When a new high-priority outage comes in, re-evaluate whether the nearest available crew changes. Keep mutual aid crews paired with a local liaison who knows the road network - mutual aid crews are fast once deployed, but they lose time without local knowledge.

Anti-pattern: Holding all decisions until the storm bridge call. That's a 30 to 60 minute information lag per cycle. Too slow.

Post-event - run the postmortem

Not optional, not optional when you're tired, not optional when the next system is already forming in the Gulf. Schedule it before the season ends. What failed to predict correctly? Which dispatch decisions were wrong in retrospect? What parts ran out first? Where did road access assumptions fail? This learning loop is what turns a good storm response into a great one three years from now.

Blameless format only. You're looking for system failures, not human failures. The same system put someone else in that seat and they'd make the same call.

The D.U.R.E.S.S. framework applied to the grid

I built the D.U.R.E.S.S. framework for distributed systems observability: Duration, Utilization, Rate, Errors, Saturation, System health. It maps to electrical grid monitoring almost perfectly - because a grid is a distributed system. Same topology, same failure modes, same need for a single pane of glass that answers "are we okay right now." This is what architect thinking looks like applied outside of software.

Dimension	In software	In the grid
Duration	Request latency, p99 response time	Mean time to restore per outage type and region. If your MTTR is climbing mid-storm, you have a dispatch or parts problem, not a crew problem.
Utilization	CPU, memory, disk percentage	Feeder load as a percentage of rated capacity. A feeder at 92 percent load has no headroom for redistribution when the next one trips.
Rate	Requests per second, throughput	Active outage creation rate versus restore rate. If outages are opening faster than they're closing, you are falling behind and need to call for more resources now, not in an hour.
Errors	HTTP 5xx rate, exception rate	Fault trip rate per segment, failed switching operations. A feeder tripping repeatedly after reset is telling you the fault isn't cleared - don't keep resetting it.
Saturation	Queue depth, connection pool exhaustion	Dispatch queue depth versus available crews. Transformer stock versus outstanding transformer jobs. When either of these hits zero, your restore rate flatlines.
System health	Synthetic checks, SLO burn rate	SAIDI and SAIFI against your seasonal budget. If you're burning SAIDI faster than your model predicted, escalate mutual aid now - not at end of quarter when the numbers look bad on a report.

What this actually takes to build

If a utility ops center built a real-time dashboard mapping these six dimensions - feeder utilization, active fault rate, outage creation vs. restore rate, crew queue saturation, parts inventory, and SAIDI burn - they would have more operational clarity during a storm than most of them have ever had. This is not exotic technology. It's a solid data pipeline from SCADA and OMS, sensible threshold definitions, and a team willing to own the dashboard. The hard part is organizational will, not the engineering.

Where to actually start

The answer is not "buy a new platform." The answer is: start with the data you already have and build the operational discipline first. Every utility already has OMS data, SCADA telemetry, and historical outage records. Most of it is sitting in a database nobody queries except for regulatory reporting.

01

Mine your outage history

Pull five years of OMS data, join it with weather records, and build your failure frequency map. You'll find your Pareto failure assets within a week. This costs almost nothing and tells you where every dollar of investment and every pre-positioned crew should go. Do this before you buy anything, build anything, or brief anyone.

What you get: A ranked list of the substations, feeders, and equipment segments responsible for the majority of your customer outage hours. This becomes the foundation for everything else.

02

Build one real runbook

Pick your most common storm failure type - probably spot transformer failure or feeder cascade depending on your region - and write a dispatch runbook. Not a 40-page PDF. A one-page decision tree: if this, then that. Who has authority to call mutual aid. What parts should be on the first truck. How to communicate restoration ETAs to the bridge call.

Test it: Run a 90-minute tabletop exercise before storm season with your dispatchers and field supervisors. The gaps in the runbook will show up immediately, in a room, not during a live P1 at 2am.

03

Instrument your top 20 percent

You don't need real-time telemetry on every asset in your territory. You need it on the assets that fail repeatedly. Take your Pareto list from step one and make sure every item on it has load monitoring, temperature sensing on transformers, and a health check that someone is actually watching. Targeted instrumentation beats broad instrumentation done poorly.

Threshold to set first: Load percentage at which a feeder is flagged as at-risk during a storm event. Start with 80 percent. Adjust based on your topology and historical cascade thresholds.

04

Pre-position once, measure everything

Pick the next predicted significant storm event and run the pre-positioning playbook manually if you have to. Stage two or three crews at forward locations based on your failure map and the weather forecast. Track every restore time, every drive time, every parts-on-truck hit and miss. The data from one event will justify the next level of investment more convincingly than any consultant deck ever could.

What to track: Crew travel time per restore, parts availability on first dispatch, and MTTR split by outage type. Compare to prior storms where you didn't pre-position.

05

Run a blameless postmortem, every time

After every significant storm event, block three hours for a structured postmortem using blameless format. What happened, when, why did detection lag, what slowed the first restore, what would better tooling have changed. Get this practice established before you automate anything. The organizational habit compounds over years into real operational improvement. Tooling without the habit is just expensive software nobody trusts.

Non-negotiable rule: No names in the postmortem write-up. You're looking for system failures, process gaps, and missing runbooks - not people to blame. The same system in different hands produces the same outcome.

Terms worth knowing

If some of this was unfamiliar, here's a quick reference - useful if you're sharing this with operations leadership or regulators who need to understand why this work matters before they fund it.

Quick reference

SAIDI: System Average Interruption Duration Index. Total customer outage minutes divided by total customers served. The primary SLO for utility reliability - lower is better.
SAIFI: System Average Interruption Frequency Index. Average number of interruptions per customer per year. Tracks frequency, not duration.
OMS: Outage Management System. The core platform utilities use to track and manage active outages. This is where your historical failure data lives.
SCADA: Supervisory Control and Data Acquisition. The real-time telemetry system for grid equipment - voltage, load, switch states. The equivalent of your infrastructure monitoring platform.
Mutual aid: Pre-negotiated agreements between utilities to share crews and equipment during major events. The equivalent of cloud burst capacity - but you have to call it in advance to get value from it.
D.U.R.E.S.S.: Duration, Utilization, Rate, Errors, Saturation, System health. An observability framework originally built for distributed software systems, applied here to grid operations monitoring.
MTTR: Mean Time To Recovery. How long it takes to restore service after a failure. In grid terms, the average restoration time per outage event - a direct measure of operational efficiency.
Cascade failure: A failure in one component that causes adjacent components to fail under redistributed load. The most damaging outage type and the most predictable if you're watching load utilization in real time.

When the grid goes dark!