When the grid goes dark!
A conversation with an electrician friend turned into a question neither of us could shake: if crew dispatch is this chaotic at a small contracting company, what does it look like for a utility during a Midwest ice storm? Turns out the answer is: not great. Now the fix on the other hand is just good engineering IMO, the kind that architects think about before anyone writes a line of code.
It started over a beer
A friend of mine owns an electrical contracting company. Smart guy, runs a tight crew, good business. We were sitting around talking shop and he started walking me through what emergency dispatch actually looks like for him - generator failures, storm callouts, middle-of-the-night outages. The whole operation runs on gut feel, phone calls, and whoever picks up. He knows from experience which of his guys is fastest on transformer work. He knows which neighborhoods have the worst access during rain. It all lives in his head.
About halfway through the conversation, we both said the same thing at the same time: if this is what it looks like for a small contractor with a handful of crews, what does it look like for a utility company managing 30 simultaneous outages in the middle of a February ice storm?
We already knew the answer. It looks like the same thing, just with more zeros, more people, higher stakes, and the same institutional knowledge locked in someone's head - except that person is now running a storm bridge call with 40 supervisors asking for updates.
The actual problem: Electrical utilities have decades of outage data, thousands of sensors, and experienced field crews. What most of them don't have is a systematic way to use any of that before a storm hits. They respond. They don't predict. And the difference between those two modes, when you're talking about 8-hour restoration windows and $150 billion in annual US economic losses from power outages, is enormous.
I've spent 10+ years designing and operating large-scale systems across industries. The failure patterns in a stressed electrical grid and a stressed software platform are nearly identical. Same root causes, same organizational behavior under pressure, same failure to use data that's already sitting there. Good architecture fixes both. The instinct is the same whether you're designing a distributed system or a dispatch model for 400 field crews.
Failure patterns: they're not random
The first thing any good engineering team does after a bad incident is ask: has this happened before? Almost always the answer is yes. Grid failures work the same way, the same 15 to 20 percent of assets account for 70 to 80 percent of storm-related outages, every single year. That's not bad luck, that's a data problem masquerading as an operations problem.
Pull five years of outage management system (OMS) data, join it with NOAA storm track records, and map failure frequency by geographic segment and equipment type. You'll have your Pareto list within a week. This isn't exotic analysis, it's the same pattern matching any architect uses when profiling a system for bottlenecks.
The failures themselves cluster into three archetypes. Each one behaves differently, costs differently, and requires a different response.
One feeder trips, downstream load redistributes instantly. Adjacent feeder tips over the edge within minutes. This is the one that hurts most, a single cascade during a storm event can account for 60 to 70 percent of total customer outage time.
An isolated transformer or distribution pole goes down. Clean blast radius, limited spread. These are the most common and the most manageable - if you have good crew routing and parts pre-staged nearby.
Equipment that was already failing on its own timeline finally gives out under storm stress. No visible external damage. No obvious cause. Just an aging transformer that picked the worst possible night to stop working.
Cascade failures in a grid look almost identical to cascading failures in distributed systems - one overloaded component starts dropping requests, others pile on, the whole system tips over. The fix in software is circuit breakers and load shedding. The fix in a grid is topology-aware switching and pre-planned load redistribution. Same concept, different medium. This is what thinking like an architect buys you - pattern recognition that crosses domains.
Prediction over reaction: the cost math
Most utilities today operate in fully reactive mode. They know something failed when a breaker trips or a customer calls. The shift to predictive costs some upfront instrumentation investment. What it saves on the back end is significant. Here's a realistic look at how a single un-predicted cascade failure costs out over a storm season versus one where the feeder was flagged and pre-staged before the storm arrived.
The math: 200 restore events. 5.3 hours of avoidable outage each. That's 1,060 hours of customer impact per storm season that better pre-positioning eliminates. The instrumentation pays for itself in the first event.
What signals actually matter
Not all sensor data is equal. After mapping what actually predicts failures versus what just adds noise, the highest-signal inputs are these, in order of operational usefulness:
How close to the edge a circuit is before storm stress lands. A feeder running at 87% capacity has very little headroom when the next feeder trips and redistributes load onto it.
Thermal runaway risk on aging equipment. A transformer running hotter than its baseline under the same load is telling you it's degrading. It's the equivalent of elevated p99 latency on a service that's about to fall over.
Meter pings going dark in sequence is the earliest real-time indicator of a cascade in progress. When you see 40 meters drop in the same 3-minute window along a feeder path, you know the direction of spread before the field crew does.
Combined with span length data, you can model line sag and contact risk for specific segments before the storm arrives. This is where weather model integration earns its keep.
Crew dispatch: the gut-feel problem
This is the part of the conversation where my electrician friend and I really connected. His dispatch system is a group text and 15 years of knowing his guys. It works because his company has 8 crews and he has mental context on all of them. Scale that to a utility with 400 crews, mutual aid from three neighboring states, 300 active outages, road closures, and a storm still moving through, and that mental model collapses completely.
At its core, crew dispatch is a vehicle routing problem. You have crews (vehicles), outage locations (stops), restoration time estimates (service times), and priority tiers (weights). Logistics companies solve harder versions of this in real time. Utilities can too. The only missing ingredient is the will to instrument it properly.
Priority tiers that actually hold up under pressure
- Hospitals and dialysis centers
- Assisted living facilities
- Medical baseline customers
- Traffic signal cluster outages
- Water treatment and pumping
- Transmission faults 1,000+ customers
- Substation-level failures
- Active cascade - isolate first
- Commercial districts, economic exposure
- Spot outages in same corridor
- Standard transformer failures
- Distribution pole damage, accessible roads
- Single-customer, no safety exposure
- Equipment damage with no active outage
- Non-storm backlog
Dispatch optimization isn't a one-time decision at the start of a storm event. It has to re-optimize continuously as new outages come in, roads close, and crews finish jobs. A plan locked at 7pm when the storm hits is stale by 9pm. The model should be rerunning every 15 to 20 minutes, same as any dynamic routing system reweights based on current conditions. Static plans fail dynamic storms.
Crew fatigue state matters too. After 14 hours, error rates climb and safety risk goes up. A dispatch system that doesn't track time-on-task isn't really a dispatch system - it's a very expensive call log.
Pre-positioning: the highest-leverage move nobody does
If I had to pick one thing that would have the largest impact on utility storm response, this is it. Pre-positioning means staging crews, parts, and mobile command near predicted failure zones before the storm makes landfall. Not waiting for outages to happen and then driving across a storm-damaged region to respond.
72 hours out - run the model
Feed your historical failure frequency map into your weather forecast. Flag the top 20 percent of at-risk segments for the storm track. Brief mutual aid partners on expected call volume and crew type mix needed. At this stage you're not moving anything yet - you're building the picture and aligning on the plan.
48 hours out - stage resources
Move mobile command units, transformer stock, and pole inventory to forward staging locations. Pre-position 30 to 40 percent of field crews within 15 miles of the predicted high-impact zone. This feels aggressive if you've never done it before. It feels obvious after the first time it saves you 4 hours per restore on 200 events.
24 hours out - lock the plan
Confirm all road access routes - and identify alternates for the ones that typically flood or get debris-blocked in your region. Brief all crews on the priority tier system. Set the storm bridge call cadence. Make sure every dispatcher has a printed copy of the dispatch priority tiers, because systems go down during storms too.
During the event - re-optimize continuously
Run your dispatch model on a 15 to 20 minute cycle. Track crew locations and time-on-task in real time. When a new high-priority outage comes in, re-evaluate whether the nearest available crew changes. Keep mutual aid crews paired with a local liaison who knows the road network - mutual aid crews are fast once deployed, but they lose time without local knowledge.
Post-event - run the postmortem
Not optional, not optional when you're tired, not optional when the next system is already forming in the Gulf. Schedule it before the season ends. What failed to predict correctly? Which dispatch decisions were wrong in retrospect? What parts ran out first? Where did road access assumptions fail? This learning loop is what turns a good storm response into a great one three years from now.
The D.U.R.E.S.S. framework applied to the grid
I built the D.U.R.E.S.S. framework for distributed systems observability: Duration, Utilization, Rate, Errors, Saturation, System health. It maps to electrical grid monitoring almost perfectly - because a grid is a distributed system. Same topology, same failure modes, same need for a single pane of glass that answers "are we okay right now." This is what architect thinking looks like applied outside of software.
| Dimension | In software | In the grid |
|---|---|---|
| Duration | Request latency, p99 response time | Mean time to restore per outage type and region. If your MTTR is climbing mid-storm, you have a dispatch or parts problem, not a crew problem. |
| Utilization | CPU, memory, disk percentage | Feeder load as a percentage of rated capacity. A feeder at 92 percent load has no headroom for redistribution when the next one trips. |
| Rate | Requests per second, throughput | Active outage creation rate versus restore rate. If outages are opening faster than they're closing, you are falling behind and need to call for more resources now, not in an hour. |
| Errors | HTTP 5xx rate, exception rate | Fault trip rate per segment, failed switching operations. A feeder tripping repeatedly after reset is telling you the fault isn't cleared - don't keep resetting it. |
| Saturation | Queue depth, connection pool exhaustion | Dispatch queue depth versus available crews. Transformer stock versus outstanding transformer jobs. When either of these hits zero, your restore rate flatlines. |
| System health | Synthetic checks, SLO burn rate | SAIDI and SAIFI against your seasonal budget. If you're burning SAIDI faster than your model predicted, escalate mutual aid now - not at end of quarter when the numbers look bad on a report. |
If a utility ops center built a real-time dashboard mapping these six dimensions - feeder utilization, active fault rate, outage creation vs. restore rate, crew queue saturation, parts inventory, and SAIDI burn - they would have more operational clarity during a storm than most of them have ever had. This is not exotic technology. It's a solid data pipeline from SCADA and OMS, sensible threshold definitions, and a team willing to own the dashboard. The hard part is organizational will, not the engineering.
Where to actually start
The answer is not "buy a new platform." The answer is: start with the data you already have and build the operational discipline first. Every utility already has OMS data, SCADA telemetry, and historical outage records. Most of it is sitting in a database nobody queries except for regulatory reporting.
Mine your outage history
Pull five years of OMS data, join it with weather records, and build your failure frequency map. You'll find your Pareto failure assets within a week. This costs almost nothing and tells you where every dollar of investment and every pre-positioned crew should go. Do this before you buy anything, build anything, or brief anyone.
Build one real runbook
Pick your most common storm failure type - probably spot transformer failure or feeder cascade depending on your region - and write a dispatch runbook. Not a 40-page PDF. A one-page decision tree: if this, then that. Who has authority to call mutual aid. What parts should be on the first truck. How to communicate restoration ETAs to the bridge call.
Instrument your top 20 percent
You don't need real-time telemetry on every asset in your territory. You need it on the assets that fail repeatedly. Take your Pareto list from step one and make sure every item on it has load monitoring, temperature sensing on transformers, and a health check that someone is actually watching. Targeted instrumentation beats broad instrumentation done poorly.
Pre-position once, measure everything
Pick the next predicted significant storm event and run the pre-positioning playbook manually if you have to. Stage two or three crews at forward locations based on your failure map and the weather forecast. Track every restore time, every drive time, every parts-on-truck hit and miss. The data from one event will justify the next level of investment more convincingly than any consultant deck ever could.
Run a blameless postmortem, every time
After every significant storm event, block three hours for a structured postmortem using blameless format. What happened, when, why did detection lag, what slowed the first restore, what would better tooling have changed. Get this practice established before you automate anything. The organizational habit compounds over years into real operational improvement. Tooling without the habit is just expensive software nobody trusts.
Terms worth knowing
If some of this was unfamiliar, here's a quick reference - useful if you're sharing this with operations leadership or regulators who need to understand why this work matters before they fund it.
- SAIDI
- System Average Interruption Duration Index. Total customer outage minutes divided by total customers served. The primary SLO for utility reliability - lower is better.
- SAIFI
- System Average Interruption Frequency Index. Average number of interruptions per customer per year. Tracks frequency, not duration.
- OMS
- Outage Management System. The core platform utilities use to track and manage active outages. This is where your historical failure data lives.
- SCADA
- Supervisory Control and Data Acquisition. The real-time telemetry system for grid equipment - voltage, load, switch states. The equivalent of your infrastructure monitoring platform.
- Mutual aid
- Pre-negotiated agreements between utilities to share crews and equipment during major events. The equivalent of cloud burst capacity - but you have to call it in advance to get value from it.
- D.U.R.E.S.S.
- Duration, Utilization, Rate, Errors, Saturation, System health. An observability framework originally built for distributed software systems, applied here to grid operations monitoring.
- MTTR
- Mean Time To Recovery. How long it takes to restore service after a failure. In grid terms, the average restoration time per outage event - a direct measure of operational efficiency.
- Cascade failure
- A failure in one component that causes adjacent components to fail under redistributed load. The most damaging outage type and the most predictable if you're watching load utilization in real time.