Insights
Articles on reliability, leadership, and platform engineering. Lessons from the field and late-night debugging sessions.
Interactive Demos
Interactive simulations to practice incident response and observability skills.
SRE Dashboard
Experience a real 2AM incident response — golden signals, distributed traces, log correlation.
Respond to incident →P1 Incident Simulator
Black Friday checkout outage scenario. Make decisions, get scored on your SRE skills.
Start simulation →SRE Learning Hub
Interactive modules on observability and resilience. Percentiles, SLOs, chaos engineering, and more.
Start learning →Core Insights
The context tax, and how to stop paying it
Every AI feature you ship runs a meter you can't see on the dashboard. It shows up as a token bill that climbs faster than usage and answers that quietly get worse. The reflex is to blame the model and swap in a cheaper one. That fixes nothing. The charge comes from the harness you built around it, collected on every turn. Here's where it comes from and how to stop feeding it.
Read articleEvery new hire goes through a scavenger hunt. I found an alternate way and won at it.
Most engineering onboarding fails not because the documentation is bad but because documentation was never the right tool for the job. Here's the architecture that retrieves the connected trail of decisions instead.
Read articleWe have 400 dashboards and still don't know if we're healthy
Somewhere between 'we have no visibility' and 'we have too much data to make sense of,' something went wrong. This is about what that something is and how the signals your systems emit can either answer the health question or make it impossible and make you go bonkers on a 2AM page.
Read articleBurning tokens or building outcomes?
Companies are buying AI tools for coding, operations, support, and data analysis. The real question is whether the token spend is turning into better work, measurable outcomes, and OKRs the business can defend.
Read articleBuilding AI you can actually trust
Guardrails, secure code, environment segregation, and the review patterns we built for a healthcare AI platform under HIPAA and SOC 2 scrutiny. No fluff. Just what we actually shipped.
Read articleHow AI Actually Works: Claude, ChatGPT and LLMs Explained Simply
No jargon, no hype. This guide explains how ChatGPT and Claude actually work under the hood: how they read your question, find context, call tools, remember things, and write an answer. Covers AI safety, guardrails, real risks, and why the doomsday headlines miss the point. Written for anyone, not just engineers.
Read articleAgentic AI for the enterprise
The term AI agent gets used for everything from a chatbot with one tool call to a system autonomously managing production infrastructure across a dozen APIs. That gap matters. This is a practitioner breakdown of what an agent actually is, how to build one properly, and what it looks like when you deploy the pattern in healthcare IT and banking operations today.
Read articleClaude skills, MCP, and knowing which one to use.
I use Claude constantly. At some point the repeat instructions piling up in every chat become a productivity tax. Skills handle how Claude thinks and executes. MCP handles what Claude can see and touch. Get that separation right from the start and everything else follows cleanly.
Read articleWhen the Grid Goes Dark
If crew dispatch is this chaotic at a small contracting company, what does it look like for a utility during a Midwest ice storm? The fix is just good engineering — the kind that architects think about before anyone writes a line of code.
Read articleObserve for Observability
Most observability platforms give you logs, metrics, and traces in separate tabs and call it a single pane of glass. Observe actually connects them. A practitioner breakdown of how Observe handles the three pillars, what the knowledge graph model means in practice, and where it compares to New Relic.
Read article