TechAni

How AI Actually Works: Claude, ChatGPT and LLMs Explained Simply

No jargon, no hype. How these systems read your question, find context, call tools, and write an answer. Plus the real safety picture and why the doomsday headlines miss the point.

Beginner Friendly | AI Systems | March 25, 2026 | ~12 min read | AniBot + Claude

This article started at a barbershop. Small town, midwest, the kind of place where the conversation covers everything from local news to the end of the world. Someone brought up AI and the room had questions: robots taking over, bots walking the streets, the whole picture painted by whatever came across their feed that week. I pulled out my phone, showed them what I was actually doing with it, and tried to explain how it works in plain English. This is that explanation, written down.

Here is the simple version first. When you ask Claude or ChatGPT something, the system reads your question, breaks it into smaller pieces, checks patterns it learned during training, may pull in outside information or tools, and then writes the answer back one small piece at a time. Modern models also handle images, audio, and video, not just text.

A few quick definitions before we start. An LLM means large language model. It is software trained on a huge amount of text, images, and other data so it can predict what comes next. A token is a small piece of text. A tool is an extra helper like web search, a calculator, a code runner, or a camera feed.

Open the sections below if you want the step by step version. If you are a barber in a small midwest town or someone who just wants to understand what is actually going on, this is written for you.

The basics: how an LLM writes text

Your text gets split into tokens. A token is a small piece of text, not always a full word. "Hello world" might become ["Hello", " world"]. Longer or unusual words often get broken into smaller parts.

Try it yourself:

Each token then gets turned into numbers. In AI, those numbers are often called a vector or an embedding. Think of that as the model's internal way of storing meaning so related words and ideas end up closer together.

Attention is the part that helps the model decide what matters most in a sentence. Each token checks the other tokens around it to figure out which words are important for meaning.

When the model reads "The cat sat on the mat", the word "sat" is linked closely to "cat" and "mat" because those words help explain who did the action and where it happened.

Large models do this many times in parallel using attention heads. An attention head is just one small pattern detector focused on a certain kind of relationship:

  • One head may track who did the action
  • Another may track what the action applies to
  • Another may connect names, pronouns, and references
  • Many others learn patterns around order, grammar, tone, and structure

The last step is prediction. The model scores many possible next tokens and picks one. Those scores are called probabilities, which just means how likely each option is.

After it picks the next token, that token gets added to the running answer. Then the model does the same thing again. That loop is why replies appear word by word even though a lot of math is happening underneath.

32-80
Transformer layers
4096+
Embedding dims
Trillion
Training tokens
$10M+
Training cost

Tool calling: how Claude or ChatGPT reaches outside the model

The model by itself does not magically know live prices, your private documents, or what is inside a spreadsheet. For that it uses tools. A tool is a connected helper like web search, a calculator, a code runner, or a company database. You may also hear this called function calling. An API is simply a formal way for one software system to ask another software system for data or an action.

api.anthropic.com/v1/messages

ChatGPT's Tools

  • Web browsing (Bing API)
  • Python execution (Jupyter)
  • DALL-E image generation
  • File upload/download
  • Data analysis (pandas, matplotlib)

Claude's Tools

  • Web search and fetch
  • Bash execution (Linux container)
  • File operations (read/write/edit)
  • Image analysis (native)
  • PDF/document parsing

Search and memory: how RAG finds the right information

When Claude or ChatGPT needs facts from documents, websites, or a company knowledge base, it usually does not rely on memory alone. It searches for relevant text first. One common setup is called RAG, short for Retrieval Augmented Generation. In plain English, that means find useful source material first, then write the answer using that material.

1536
Embedding dimensions
<50ms
Vector search time
100M+
Documents indexed
5-10
Chunks retrieved

Code execution: how the AI can test things safely

Some AI products can run code, inspect files, or make charts. They do this inside a sandbox. A sandbox is an isolated workspace that keeps the run separate from the rest of the system so mistakes or risky code do less damage.

ChatGPT Sandbox

  • Jupyter Python kernel
  • Docker isolated container
  • No internet access
  • 10GB disk space
  • Pre-installed packages (pandas, numpy, matplotlib)
  • 30-120s execution timeout

Claude Sandbox

  • Ubuntu Linux container
  • Full bash shell access
  • Limited internet (approved domains)
  • Package installation (pip, npm, apt)
  • File system read/write
  • 30-120s execution timeout

When you ask AI to build software, it does not just guess. It considers options, checks its tools, thinks through the logic step-by-step, and then brings the app to life.

Infrastructure: the systems behind the answer

All of this sits on backend infrastructure, which is just the hidden machinery behind the app. That includes GPUs for the heavy math, search systems for finding context, sandboxes for running code, and monitoring tools that watch speed, failures, and cost. You may hear the word inference here. Inference simply means the model generating an answer. Latency just means how long you wait for that answer.

A100/H100
GPU clusters
3-8s
End-to-end latency
$5-50
Cost per 1000 requests
40-60%
Cache hit rate

Total cost per 1000 requests: $5-50 depending on model choice, tools used, and complexity. This is why ChatGPT Plus ($20/month) and Claude Pro ($20/month) exist - subscriptions assume 500-2000 requests/month. Heavy users cost more than subscription revenue. Light users subsidize them.

AI agents: how the system handles multi-step work

Sometimes you do not want one answer. You want the system to handle a job with several steps, like checking an order, applying a policy, sending an email, and updating a record. That is where AI agents come in. An agent is simply a model plus tools plus a step-by-step loop for deciding what to do next.

🧠 Agent (the coordinator)

This is the part that reads the goal, picks the next step, and decides when the job is done. Think of it like a coordinator, not a magic brain.

🔧 Tools (the helpers)

These are the connected helpers the agent can use: search a knowledge base, check a database, send an email, update a system, or write a file. Each tool has rules about what input it expects and what it returns.

🔄 Workflow (the step-by-step loop)

The workflow is the repeatable loop. Read the goal. Pick a step. Use a tool. Look at the result. Decide the next step. Stop when the task is done or hand it to a person if it gets stuck.

agent-platform.company.com/workflows/customer-refund
GOAL
"Customer claims their order #8472 arrived damaged. Process refund and send confirmation email."
lookup_order(order_id: str)
Queries order database. Returns order details, status, items, customer info.
check_refund_policy(reason: str, order_date: str)
Checks if refund is allowed per company policy. Returns eligibility and conditions.
process_refund(order_id: str, amount: float, reason: str)
Initiates refund via payment processor. Returns transaction ID and estimated completion.
send_email(to: str, template: str, variables: dict)
Sends templated email to customer. Returns delivery status.
update_crm(customer_id: str, notes: str, tags: list)
Updates customer record in CRM. Adds notes and tags for future reference.
1
Agent receives goal and available tools
The model sees the goal, the tool descriptions, and any earlier conversation.
2
Agent plans the next action
Example thought: "I need order details first"
Action: call lookup_order(order_id="8472")
3
Tool runs and returns a result
Observation: Order found, total $89.99, status "delivered", customer email: [email protected]
4
Agent updates plan and continues
It loops back with the new result, then keeps going until the task is done or it needs help.
Error handling:
  • Tool fails: Agent sees error message, tries alternative approach
  • Ambiguous result: Agent asks clarifying question or makes best-effort decision
  • Max iterations hit: Agent summarizes progress, flags for human review

How the system keeps its place

These systems need to remember where they were. Teams save the current state, which means the conversation, tool results, and next step, in a database. If the process stops or times out, it can restart from the last saved point instead of beginning again.

Where a person still has to approve

For important actions, a human still needs to sign off. The agent prepares the action, pauses, and waits for approval. This is common for money movement, risky customer messages, deletions, or policy exceptions.

Safety rules around tools

Not every tool is safe to let loose automatically. Real systems put clear limits around what the agent can do:

  • Rate limits per tool (max 10 DB queries per workflow)
  • Permission checks (agent can read orders, not delete them)
  • Cost limits (max $5 in API calls per workflow)
  • Audit logs (every tool call logged with timestamp, inputs, outputs)

How teams watch what the agent is doing

If you cannot see what the agent is doing, you cannot trust it. Teams track simple things like success rate, number of steps, common failures, and cost:

Success rate
Goal completion %
Avg steps
Tool calls per workflow
Failure modes
Where agents get stuck
Cost per run
LLM + tools combined

Tracing tools follow each workflow from start to finish. Dashboards show slow steps, repeated loops, and failures so teams can fix the system before it causes trouble.

Common software used to build these systems

LangGraph (LangChain)

  • State machines for agent workflows
  • Built-in persistence and checkpointing
  • Python-native, integrates with LangChain tools
  • Good for complex multi-agent systems

CrewAI

  • Multiple agents working together
  • Role-based agents (researcher, writer, reviewer)
  • Sequential and hierarchical workflows
  • Good for content generation pipelines

AutoGen (Microsoft)

  • Conversational agent framework
  • Agents that talk to each other
  • Built-in code execution and debugging
  • Good for research and analysis tasks

Custom (Roll Your Own)

  • Direct LLM API + tool calling
  • Full control over how the workflow runs
  • Minimal dependencies, max flexibility
  • Good for production at scale

Going deeper: how tokenization actually works

The intro above uses "tokens" as shorthand. This section is for people who have already built on top of an LLM API or are writing prompts professionally and want to understand what is actually happening at that layer. The design decisions made here have real consequences for model behavior, cost, and context window math.

The model never sees words. It sees integers. Every piece of input, your query, the system prompt, retrieved context, conversation history, gets converted to a flat sequence of token IDs before a single matrix multiply happens. How that conversion works matters more than most people realize.

A vocabulary is the complete set of units a model can recognize and produce. How you define those units involves a hard tradeoff with no perfect answer.

W
Word-level

Split on whitespace and punctuation. Intuitive, but vocabulary bloat kills you fast. English surface forms explode once you account for plurals, conjugations, compounds, proper nouns, typos, and domain jargon. Anything outside the fixed vocabulary at inference time becomes an unknown token, breaking the prediction chain. Pre-2018 NLP systems lived and died by this problem.

C
Character-level

Every Unicode codepoint is its own token. Vocabulary stays tiny, you never hit an unknown token, and you can handle any script. The cost: sequence length explodes. English words average around 4-5 characters plus spaces, so a 512-token word-level context becomes roughly 2,500 tokens at the character level for the same passage. Learning long-range dependencies across that expanded length is significantly harder to train.

S
Subword (what everything uses today)

High-frequency strings get their own token. Low-frequency strings get decomposed into smaller pieces that individually appear more often. You get vocabulary coverage close to character models without the sequence length penalty. GPT-4, Claude, Llama, Gemini, and every other production LLM you interact with uses a subword scheme.

BPE is a bottom-up compression algorithm adapted for vocabulary construction. The loop is simple in principle:

The result: common strings like function or return get a single token. Rare words get decomposed into pieces, often slicing right through what a linguist would call morpheme boundaries. That is intentional: BPE optimizes for corpus frequency, not linguistic structure, and those two things do not align.

If you have spent time prompting models at scale, you have probably run into behaviors that trace back to the tokenization layer rather than the model weights.

Character counting failures

Ask a model to count the letter "s" in "Mississippi" and it often gets it wrong. The model does not operate on characters directly. If "iss" merges into a single token, the individual letters inside it are not immediately recoverable from the model's internal representation. It has to reconstruct character-level information from the embedding, which is lossy.

Whitespace-sensitive tokenization

The token for "python" and " python" with a leading space are usually different IDs with different embeddings. Stripping whitespace from prompts can produce subtly different outputs. This matters if you are doing token-level log probability analysis or building evals that compare raw completions.

Rhyme and phonetics

Models struggle with rhyming more than their language fluency suggests they should. Two words can rhyme while sharing almost no subword tokens. The model has to infer phonetic similarity from spelling patterns, which are encoded unevenly across the vocabulary depending on how each word was tokenized.

Multilingual token inflation

A vocabulary built on an English-heavy corpus tokenizes English efficiently and everything else less so. A 100-token English passage might cost 300-500 tokens in a morphologically rich language with less training representation. Your effective context window shrinks for non-Latin input, and you burn more tokens per API call for the same semantic content.

Each token ID maps to a learned embedding vector. That vector is not engineered by hand. It starts random and gets updated continuously during training as the model learns to predict the next token across trillions of examples. By the end of training, the embedding for a given token implicitly encodes everything the model observed that token doing across the corpus.

This raises a real question: does the embedding for a token like deploy "know" it is spelled d-e-p-l-o-y? Formally, no. The token is just an integer. But research shows character-level information does bleed into embeddings. The most likely reason: the same root word surfaces in many tokenized forms across a large corpus (singular, plural, past tense, mid-sentence, sentence-initial, after punctuation, inside a code block) and the model is incentivized to build representations that generalize across those surface variations.

Practical implication: models handle spelling and character-level tasks better for common words (which appear in many tokenized surface forms during training) and worse for rare words or niche proper nouns that were almost always tokenized the same way. If you are building a prompt that requires character-level precision, explicitly spell out the string in your prompt rather than assuming the model can decompose it cleanly.

Note for API users

Tokenizers are versioned separately from models. cl100k_base (GPT-4) and Claude's tokenizer have different vocabularies, different merge rules, and will produce different token counts for the same input. Never estimate token count by dividing word count by 0.75 for anything where cost or context window headroom matters. Run the actual tokenizer for the model you are calling.

~100K
Vocab size in modern LLMs
4-5
Avg chars per English token
2-5x
Token inflation for non-Latin scripts
BPE
Algorithm behind GPT-4 and Claude

Is AI actually safe?

Short answer: for the models you use every day, the sci-fi robot takeover is not the threat. The real risks are more boring and more immediate: disinformation, autonomous weapons, surveillance, and bias baked into hiring or sentencing systems. Safety is a real engineering discipline and there are multiple layers of it in every serious AI product. Here is what actually exists.

Before a model ever reaches you, it goes through a process called alignment training. This is where the model is trained not just on what is factually correct but on what is helpful, harmless, and honest.

The most common technique is called RLHF (Reinforcement Learning from Human Feedback). Human reviewers rate thousands of model responses. The model is then updated to produce more of the good responses and fewer of the bad ones. Over many rounds, this shapes the model's behavior before anyone outside the lab sees it.

  • The model learns to decline requests that could cause harm
  • It learns to flag uncertainty instead of making things up confidently
  • It learns to avoid helping with things that are clearly illegal or dangerous
  • It learns tone, nuance, and when to push back on a bad idea

This is not perfect. Models can still be pushed in wrong directions with the right prompts. But the baseline behavior is trained, not bolted on as an afterthought.

Every serious AI product runs on top of a system prompt. Before the model sees your message, the company or developer who built the product has already given the model instructions: what it can talk about, what it should refuse, what persona it should use, and what safety rules to follow.

Think of it as the employer giving new rules to a contractor. The model follows them. So even if the underlying model could technically answer something, the system prompt can block it entirely.

On top of that, most platforms run content filters that check both input and output:

  • Input filters catch known harmful patterns before they reach the model
  • Output filters check the model's response before it reaches you and block or rewrite anything that crosses a line
  • Rate limits and monitoring flag unusual usage patterns and slow down or stop automated abuse

The model by itself cannot do anything. It reads text and writes text. That is the whole thing. It cannot reach out to the internet, execute code, send emails, move money, or control systems unless a tool is explicitly connected and the system running it allows it.

When a model does have tools, those tools are scoped. A customer support agent with access to your order data does not also have access to your bank account. The access is defined by whoever built the product, not decided by the model.

This is important because it means the risk surface is well-defined. An agent that can only read your calendar and draft meeting summaries is limited to exactly those two things. The worry about "the AI doing something unexpected" is really a question about whoever wired it up and what permissions they granted.

Before any major model ships, the company runs red team exercises. Red teaming is where a group of people try as hard as they can to break the model: get it to say something harmful, bypass its safety rules, produce dangerous content, or behave in an unexpected way.

What they find gets fixed before release. What slips through gets patched after. This is not so different from how security teams test software. You ship, you monitor, you patch, you repeat.

Anthropic (who makes Claude), OpenAI (ChatGPT), Google (Gemini), and Meta (Llama) all publish safety reports and run external third-party audits. This does not mean the models are perfect. It means there is a process, it is improving, and problems are documented publicly.

The doom question

Every few months someone writes a piece about AI ending humanity. Here is a more grounded take on what the actual risk picture looks like today.

🚫 Overhyped / not how it works

  • A model spontaneously "waking up" with goals
  • AI deciding on its own to take over systems
  • Terminator-style self-preservation instincts
  • Models conspiring without human instruction

🔴 Already happening right now

  • AI-generated disinformation at election scale
  • Autonomous targeting in active warzones
  • Deepfakes used for fraud and harassment
  • Biased models used in hiring and sentencing

⚠️ Real and getting harder to ignore

  • Multi-agent systems with minimal human oversight
  • Concentration of capability in 3-4 companies
  • Economic disruption faster than retraining allows
  • Alignment gaps at much larger model scales

✅ Already working reasonably well

  • Refusing clearly harmful requests
  • Transparency about being an AI
  • Scoped tool access in production systems
  • Public safety research and third-party audits

The practical reality: The models you use every day are tools. Very capable, sometimes surprising tools, but tools. They do not have goals. They do not want anything. ChatGPT and Claude both have persistent memory now by default, but that memory is yours to manage and delete. The risks worth taking seriously are not robots going rogue. They are the same risks you take seriously with any powerful technology: who controls it, what it has access to, how it is audited, and whether the people building it are honest about what it can and cannot do. The warfare and surveillance uses are the part that should genuinely concern you. The doom headlines about sentient AI are mostly a distraction from that.

That is the simple version of how Claude and ChatGPT work. Not magic. A stack of models, data, tools, and guardrails.

Behind one clean reply, the system may be predicting text, searching for context, calling tools, checking results, and managing cost and safety. Once you can see the pieces, the black box starts to look a lot more understandable.