Claude skills, MCP, and knowing which one to use.
I use Claude constantly for writing insights like this, writing code, for incident analysis, for drafting client deliverables. At some point the repeat instructions piling up in every chat become a productivity tax, and you start looking at the tooling Anthropic actually built to solve that. That is where skills and MCP come in, and where most people get confused about which one to reach for.
Most people building Claude automations run into the same wall: they start with skills, realize they need live data, bolt on an MCP server, and then wonder why the whole thing is fragile and they blow through a million tokens in a short period. The answer is not that skills or MCP are wrong. It is that they solve different problems, and conflating them creates a mess that is hard to debug and harder to maintain.
This is the mental model I use when scoping out agentic AI work, both for internal tooling and for client engagements. Skills handle how Claude thinks and executes. MCP handles what Claude can see and touch. Get that separation right from the start and everything else follows cleanly.
The three primitives
Anthropic gives you three distinct tools for extending Claude's capability. Each occupies a different layer of the stack.
These three layers compose, they do not compete. A well-built agentic system will often use all three: projects hold background context, skills define the execution logic, MCP provides live data access. The mistake is treating them as interchangeable.
Skills in depth
A skill is not a prompt. It is a file-system artifact: a folder containing a SKILL.md
file and optionally a references/ directory and a scripts/
directory. Drop the folder into ~/.claude/skills/ and Claude discovers it
automatically. No configuration file. No registration step.
# Minimal skill
csv-cleaner/
SKILL.md
# Full skill with references and scripts
incident-reporter/
SKILL.md
references/
severity-guide.md
escalation-matrix.md
scripts/
parse-logs.py
calculate-mttr.py
The YAML frontmatter block at the top of SKILL.md controls when a skill activates. Claude scans all available skills and scores each description against your prompt. The highest-scoring skill fires. If nothing clears the threshold, Claude responds normally. This means your description is not documentation. It is the activation mechanism, and it needs to be explicit to the point of being almost embarrassingly specific about trigger phrases.
---
name: incident-reporter
description: |
Generates structured incident reports from raw on-call logs, alert
data, or freeform incident notes. Use this skill whenever the user
says "write up the incident", "create a postmortem", "draft the
incident report", or provides a raw timeline and asks for a
structured output. Also activates when the user pastes alert
data and asks for a summary.
Do NOT use for: general log analysis, SLO reviews, or capacity
planning tasks — those have separate skills.
---
## Overview
This skill transforms raw incident data into a structured postmortem
following the DURESS framework...
The description field is the single most load-bearing line in the whole file. If it is vague, the skill never fires. If it is too broad, it hijacks unrelated conversations. The sweet spot is an aggressively specific list of trigger phrases paired with explicit negative boundaries: what the skill should NOT activate for.
When to add scripts
Plain English instructions handle judgment, formatting, and language tasks cleanly. But some tasks
require actual computation: calculating running averages, parsing XML, resizing images in a batch. That
is when you add a scripts/ directory.
The division is clean in practice: scripts handle computation, instructions handle judgment. Your SKILL.md tells Claude exactly which script to run, with what arguments, and how to interpret the output. The script does the math. Claude does the writing.
Every script should accept input/output file paths as command-line arguments and never hardcode paths. Include a comment block at the top explaining arguments, expected output, and possible error conditions. Claude reads that documentation to communicate failures back to the user.
MCP in depth
MCP (Model Context Protocol) is a connectivity layer. It has nothing to do with how Claude executes a task. It is purely about what external systems Claude can interact with during that execution. Think of it as giving Claude a set of tools: a calendar tool, a database query tool, an email send tool. Skills tell Claude what to do; MCP tools give Claude the ability to do it.
MCP servers expose a defined set of tools (list_events, create_task, query_db, send_email) and Claude decides when and how to call them. You write an MCP server once and any agent or skill can use it. That is the key architectural advantage at scale: MCP tools are reusable across skills. Your New Relic MCP server that queries alert data can be used by your incident reporter skill, your SLO review skill, and your capacity planning skill without duplicating a line of code.
Running your own MCP server
MCP servers are lightweight HTTP or stdio processes. You can build one in Python or TypeScript in an afternoon. Each tool you expose follows the same pattern: a name, a description Claude uses to decide when to call it, an input schema, and the implementation.
from mcp.server import Server
from mcp.types import Tool, TextContent
import httpx
app = Server("newrelic-mcp")
@app.list_tools()
async def list_tools():
return [
Tool(
name="query_alerts",
description="Query open alerts from New Relic for a given service",
inputSchema={
"type": "object",
"properties": {
"service_name": {"type": "string"},
"severity": {"type": "string", "enum": ["critical", "warning"]}
}
}
)
]
@app.call_tool()
async def call_tool(name, arguments):
if name == "query_alerts":
# hit New Relic API, return results
data = await fetch_newrelic_alerts(**arguments)
return [TextContent(type="text", text=str(data))]
Keep each MCP server scoped to a single system or domain: one server for New Relic, one for PagerDuty, one for your internal CMDB. Monolithic MCP servers that talk to five different systems are hard to version, hard to secure, and hard to debug when something breaks.
Skills vs MCP: the honest decision matrix
The question I get most often is: should I build this as a skill or as an MCP server? The answer depends on what the task actually needs.
- You are codifying a repeatable workflow that Claude already has the data for
- The task is primarily about judgment, formatting, or language transformation
- You want Claude to follow a precise step-by-step process every time
- The input arrives in the conversation: pasted text, uploaded file, typed request
- You are automating something you currently type out as instructions 3+ times a week
- Claude needs to read or write to a live external system during the task
- The data changes between sessions and cannot be pasted in each time
- Multiple different skills or agents will need access to the same external system
- You need Claude to take actions in the world: create tickets, send messages, update records
- You are building a tool that other developers will also need to use
- The task requires live data and structured execution logic
- Example: an incident reporter skill that queries alert data via New Relic MCP
- Example: a capacity planning skill that reads current utilization via a metrics MCP
- Example: an SLO review skill that pulls current burn rates before generating the report
- The task is genuinely one-off and will not repeat
- The output format changes every time, so just prompt Claude directly
- You are still figuring out the process. Build the skill after you know what you want.
What this actually costs
This is the part people skip until the bill shows up. The architecture choices you make have a direct token cost, and a bloated MCP setup can burn through budget 10x faster than a clean skill-only workflow for the same task.
Here is a realistic comparison for a single incident report generation, using current API pricing (Haiku 4.5 at $1.00/$5.00 per million tokens):
Same output. 10x the cost. At 50 runs a day on a team, that is the difference between $12.50/month and $103.50/month on Haiku alone. Scale that to Sonnet and the numbers hurt more. The discipline of keeping each layer doing only its job is not just architectural cleanliness, it is budget discipline.
Building an MCP server that dumps an entire system's state into context on every call instead of exposing targeted query tools. If your MCP server returns 50k tokens of "here is everything about this service" instead of answering the specific question asked, you are paying for context you do not need. Scope your tools. Return only what the skill actually asks for.
Managing skills at scale
One skill is simple. Ten skills start to conflict. At twenty skills, without deliberate architecture, you will have activation collisions, ambiguous triggers, and skills that fire in the wrong context. This is the multi-skill orchestration problem, and it gets worse the more you build.
Claude's selection mechanism is straightforward: it reads all available skill descriptions, scores each against your request, and activates the highest-scoring match. If two skills have overlapping trigger phrases, the wrong one will sometimes win. If descriptions are too vague, nothing fires. If they are too broad, everything fires.
The most common failure at scale is when two skills share a trigger verb. "Review this" matches both a code review skill and a document review skill. The fix is not to disambiguate the trigger. It is to write the negative boundaries explicitly. Your code review skill should say "Do NOT activate for document, contract, or prose reviews."
Three rules for a conflict-free skill library
Every skill owns a clearly defined domain. The incident reporter handles postmortems. The SLO reviewer handles burn rates and error budgets. The capacity planner handles utilization trends. No overlap. Draw the lines before you write a single SKILL.md.
Every skill's YAML description must explicitly list other skills' territories as exclusions. This is not optional at scale. Your incident reporter should say "Do NOT use for SLO reviews, capacity planning, or alert triage." Your SLO reviewer should say "Do NOT use for postmortems or incident timelines."
If you find yourself using the same trigger phrase for two different skills, one of them has a scope problem. "Write up the incident" should match exactly one skill. "Review our SLO burn" should match exactly one skill. If there is ambiguity, resolve it at the description level rather than hoping Claude picks correctly.
State management across sessions
Context windows are finite. For long-running projects like a multi-week observability rollout, an ongoing audit, or a book written chapter by chapter, Claude forgets what happened in the previous session. The standard workaround is a context log pattern baked into the skill itself.
## Session continuity
At the start of every session, read context-log.md to understand
what was completed last time, what is pending, and what decisions
were made.
At the end of every session, append a summary to context-log.md
with:
- What was completed this session
- What is blocked and why
- What should happen next
- Any decisions made that affect future sessions
This is the shift handover pattern from incident response. The incoming responder reads the chart, knows exactly what happened, and picks up without losing context. Your skill does the same thing. It reads its own notes from the previous session before it does anything else.
MCP at scale: one server per domain
The equivalent management problem for MCP is different. Instead of activation conflicts, you get tool name collisions, authentication sprawl, and versioning headaches. The cleanest architecture keeps MCP servers scoped tightly.
| Scale tier | Skills approach | MCP approach | What breaks without discipline |
|---|---|---|---|
| 1-5 skills | Write as you go, minimal architecture needed | 1-2 MCP servers, probably personal use | Not much. This is the happy path. |
| 5-15 skills | Map territories before writing. Audit descriptions for overlap. | Dedicated server per external system. Shared auth layer. | Activation collisions, tool name conflicts, debug hell |
| 15+ skills | Formal skills registry. Regular conflict audits. Versioning discipline. | API gateway in front of MCP servers. Access control per skill. | Skills library becomes unmaintainable. MCP becomes a sprawl of undocumented endpoints. |
Common mistakes I keep seeing
These are not theoretical. They are patterns I have run into repeatedly, both in my own builds and in client engagements where someone handed me a broken automation to fix.
A skill is for repeatable task execution, not for personality and global instructions. If you are copying your entire system prompt into SKILL.md, you are using the wrong layer. Global context, tone guidelines, and background knowledge about your company go in a Project. The skill handles the specific task.
If the data is static, or changes once a quarter, you do not need an MCP server. Uploading a reference document to a Project costs nothing and works immediately. I have seen teams build and maintain a full MCP server just to serve a severity matrix that gets updated twice a year.
Iterating on a SKILL.md by pasting new versions into the conversation and asking Claude to compare them. Your skills/ directory is code. It goes in git. You get diffs, rollbacks, and a history of what changed and why. Treating it as a chat artifact means you will lose changes, and you will have no idea why a skill started behaving differently last Tuesday.
git init your skills directory on day one. Commit every description change with a note on what you were trying to fix.Building a single MCP server that connects to New Relic, PagerDuty, Slack, your CMDB, and your ticketing system. It feels efficient until you need to update the PagerDuty auth token and the whole server goes down. Or until you want to give one team access to metrics but not ticketing.
Encoding a workflow into a skill before you have run it manually enough times to know all the edge cases. You end up with a skill that works 80% of the time and fails in ways that are hard to diagnose because the instructions do not cover the cases you had not seen yet.
How to know if your skill is actually working
This is the part of the docs nobody writes. You drop a SKILL.md into your skills directory, send a message that should trigger it, and Claude responds. But did the skill fire, or did Claude just happen to do something similar on its own? Here is how to tell, and how to iterate without breaking everything.
Verifying activation
The most direct check: ask Claude explicitly. After any response that should have used a skill, ask "which skill did you use for that?" Claude will tell you the skill name if one activated, or tell you no skill fired. Do this every time you write a new skill until you have confidence in the trigger language.
The second signal is output structure. If your skill specifies a precise output format (specific sections, a particular heading structure, code blocks in a defined order) and the response matches it exactly, the skill fired. If the output is Claude's default style for that kind of task, it probably did not.
When a skill does not fire: first check if your trigger phrase exactly matches something in the description. Then check for competing skills that might be outscoring it. Then broaden the description with more trigger phrase variants. When a skill fires too often: add explicit negative boundaries. Start with the most obvious adjacent tasks it should not handle.
Iterating safely
Never edit a live skill description and immediately test it in a real workflow. Keep a
test-prompts.md file next to each skill with 5-10 example inputs that should trigger it
and 3-5 that should not. Run through that list after every description change before you use it in
production. Takes two minutes. Saves a lot of confusion later.
When you are tuning a description that keeps misfiring, change one thing at a time. Add one trigger phrase, test. Add one negative boundary, test. If you rewrite the whole description at once and something breaks, you will not know which change caused it.
# incident-reporter — test prompts
## Should activate
- "write up last night's incident"
- "create a postmortem for the payment service outage"
- "draft the incident report, here's the timeline: [...]"
- "we had a p1 this morning, can you structure it up"
## Should NOT activate
- "what's our current SLO burn rate"
- "review this runbook and tell me what's missing"
- "analyze these logs for anomalies"
- "can you summarize what happened in standup"
Turning your existing prompts into skills in 15 minutes
If you have been using Claude for a while, you already have skills. They just live in your chat history as repeated instructions you paste every time. Here is how to extract them properly.
Scroll back through your Claude chats. What instruction do you paste most often? The one you have practically memorized because you type it three times a week. That is your first skill. Copy the best version of it.
Make a directory in ~/.claude/skills/ named after the task (use kebab-case,
keep it short). Create a SKILL.md file inside it. That is all you need to
start.
Before you paste in your prompt, write the name and description fields. List every phrase you actually use to kick off this task. Think about what you type, not what sounds official. "clean up these meeting notes" is better than "meeting transcript summarization".
Your existing prompt goes in as the Overview and Instructions sections. Do not rewrite it yet. Get it working first. You can refine later once you know the activation is solid.
Think about the two or three tasks most likely to be confused with this one. Add a "Do NOT use for" section in the description listing them explicitly. This one step prevents most activation problems before they happen.
Send the exact phrase you normally use to kick off the task. Verify the skill fired (ask Claude which skill it used). If it works, commit to git immediately. You now have a versioned, reproducible workflow instead of a prompt in your clipboard.
The whole thing should take 15 minutes for a skill you already know well. The investment pays back on the first time a teammate asks "how do you do that so consistently?" and you can just point them at a file.
The environment split that matters
Skills behave differently depending on where they run, and this affects what you can build.
Claude Code is the command-line interface for developers. Skills here run in your
project directory under .claude/skills/ or globally at ~/.claude/skills/.
They have full access to the file system, bash, and can execute scripts. This is where you build skills
that manipulate files, run Python scripts, or interact with your codebase. MCP servers registered here
have access to local tools, local credentials, and local processes. This is the environment for serious
automation work.
Claude Desktop (CoWork) is the desktop agent for non-developers. Same SKILL.md format, different execution context. Skills here can interact with your screen and applications through the agent's GUI capabilities, but they cannot run arbitrary bash commands. MCP servers in this context tend to be hosted services rather than local processes.
If you are building skills for a team, Claude Code is the right environment. It supports scripting, version control for your skills/ directory, and proper MCP server configuration per project. CoWork is better for non-technical users who need the same skill to work through a GUI without touching a terminal.
What this looks like end to end
Here is a concrete example from an SRE context. We needed an incident reporter that could pull live alert data, structure it, and output a formatted postmortem draft. Three layers:
Upload severity definitions, escalation matrix, and team roster as reference documents in a project. Claude reads these as background knowledge in every session.
Build a New Relic MCP server exposing three tools: query_alerts, get_deployment_history, and fetch_error_rate. Claude calls these during execution to pull current data. No copy-pasting alert payloads into the chat.
Write an incident-reporter SKILL.md that defines the exact postmortem structure, tells Claude when to call each MCP tool, specifies the output format, and handles edge cases like missing data or multi-service incidents. Add a calculate-mttr.py script for the timeline math.
The result: type "write up last night's incident" and Claude pulls the alert data via MCP, calculates the timeline via the script, applies the structure from the skill, and outputs a draft postmortem that follows the team's format. No instructions pasted. No data copied. One command.
Where to start
If you are new to this: start with one skill, no MCP. Find the task you type instructions for most often and encode it. Get comfortable with the SKILL.md format and the activation mechanism. Add MCP when you hit the wall where Claude needs data it cannot see in the conversation.
If you are building for a team: map your skill territories before writing a single file. Define the
negative boundaries. Set up a shared ~/.claude/skills/ directory in version control.
Build your first MCP server scoped to a single external system. Expand from there.
The primitives are simple. The discipline is in knowing which layer solves which problem and not conflating them when you are under pressure to ship.