How AI Actually Works: Claude, ChatGPT and LLMs Explained Simply

This article started at a barbershop. Small town, midwest, the kind of place where the conversation covers everything from local news to the end of the world. Someone brought up AI and the room had questions: robots taking over, bots walking the streets, the whole picture painted by whatever came across their feed that week. I pulled out my phone, showed them what I was actually doing with it, and tried to explain how it works in plain English. This is that explanation, written down.

Here is the simple version first. When you ask Claude or ChatGPT something, the system reads your question, breaks it into smaller pieces, checks patterns it learned during training, may pull in outside information or tools, and then writes the answer back one small piece at a time. Modern models also handle images, audio, and video, not just text.

A few quick definitions before we start. An LLM means large language model. It is software trained on a huge amount of text, images, and other data so it can predict what comes next. A token is a small piece of text. A tool is an extra helper like web search, a calculator, a code runner, or a camera feed.

Open the sections below if you want the step by step version. If you are a barber in a small midwest town or someone who just wants to understand what is actually going on, this is written for you.

The basics: how an LLM writes text

01 - TRANSFORMER ARCHITECTURE

Tokens and embeddings

↓

Your text gets split into tokens. A token is a small piece of text, not always a full word. "Hello world" might become ["Hello", " world"]. Longer or unusual words often get broken into smaller parts.

Try it yourself:

Each token then gets turned into numbers. In AI, those numbers are often called a vector or an embedding. Think of that as the model's internal way of storing meaning so related words and ideas end up closer together.

02 - TRANSFORMER ARCHITECTURE

Attention: how the model keeps track of meaning

↓

Attention is the part that helps the model decide what matters most in a sentence. Each token checks the other tokens around it to figure out which words are important for meaning.

When the model reads "The cat sat on the mat", the word "sat" is linked closely to "cat" and "mat" because those words help explain who did the action and where it happened.

Large models do this many times in parallel using attention heads. An attention head is just one small pattern detector focused on a certain kind of relationship:

→One head may track who did the action
→Another may track what the action applies to
→Another may connect names, pronouns, and references
→Many others learn patterns around order, grammar, tone, and structure

03 - TRANSFORMER ARCHITECTURE

How the answer is written one token at a time

↓

The last step is prediction. The model scores many possible next tokens and picks one. Those scores are called probabilities, which just means how likely each option is.

After it picks the next token, that token gets added to the running answer. Then the model does the same thing again. That loop is why replies appear word by word even though a lot of math is happening underneath.

32-80

Transformer layers

4096+

Embedding dims

Trillion

Training tokens

$10M+

Training cost

Tool calling: how Claude or ChatGPT reaches outside the model

The model by itself does not magically know live prices, your private documents, or what is inside a spreadsheet. For that it uses tools. A tool is a connected helper like web search, a calculator, a code runner, or a company database. You may also hear this called function calling. An API is simply a formal way for one software system to ask another software system for data or an action.

api.anthropic.com/v1/messages

ChatGPT's Tools

Web browsing (Bing API)
Python execution (Jupyter)
DALL-E image generation
File upload/download
Data analysis (pandas, matplotlib)

Claude's Tools

Web search and fetch
Bash execution (Linux container)
File operations (read/write/edit)
Image analysis (native)
PDF/document parsing

Search and memory: how RAG finds the right information

When Claude or ChatGPT needs facts from documents, websites, or a company knowledge base, it usually does not rely on memory alone. It searches for relevant text first. One common setup is called RAG, short for Retrieval Augmented Generation. In plain English, that means find useful source material first, then write the answer using that material.

RAG PIPELINE

How vector search works in plain English

↓

Ingestion: Documents → Chunks → Embeddings

First, documents are broken into smaller chunks, which means short passages instead of one huge file. Each chunk is turned into an embedding, which is a number-based fingerprint of meaning, and stored in a special search system called a vector database.

Query: Embed the Question

Your question is turned into the same kind of embedding so the system can compare meaning, not just exact keywords.

Vector Search: Find Similar Chunks

The vector database looks for chunks that mean something similar to your question. This is why it can find useful passages even when your wording is different from the document.

Re-ranking: Precision Scoring

A second pass may re-rank the top matches. Re-ranking just means sorting the best candidates more carefully so the final answer uses the strongest sources.

Generation: Answer with Grounding

The best chunks are added to the model's context window, which is the text the model can currently see. Then the model writes an answer based on those sources. Good systems also attach citations or admit when the sources do not contain the answer.

1536

Embedding dimensions

<50ms

Vector search time

100M+

Documents indexed

5-10

Chunks retrieved

Code execution: how the AI can test things safely

Some AI products can run code, inspect files, or make charts. They do this inside a sandbox. A sandbox is an isolated workspace that keeps the run separate from the rest of the system so mistakes or risky code do less damage.

ChatGPT Sandbox

Jupyter Python kernel
Docker isolated container
No internet access
10GB disk space
Pre-installed packages (pandas, numpy, matplotlib)
30-120s execution timeout

Claude Sandbox

Ubuntu Linux container
Full bash shell access
Limited internet (approved domains)
Package installation (pip, npm, apt)
File system read/write
30-120s execution timeout

INTERACTIVE DEMO

Watch AI Build a Simple App From Scratch

↓

When you ask AI to build software, it does not just guess. It considers options, checks its tools, thinks through the logic step-by-step, and then brings the app to life.

Infrastructure: the systems behind the answer

All of this sits on backend infrastructure, which is just the hidden machinery behind the app. That includes GPUs for the heavy math, search systems for finding context, sandboxes for running code, and monitoring tools that watch speed, failures, and cost. You may hear the word inference here. Inference simply means the model generating an answer. Latency just means how long you wait for that answer.

A100/H100

GPU clusters

3-8s

End-to-end latency

$5-50

Cost per 1000 requests

40-60%

Cache hit rate

COST BREAKDOWN

What Each Request Actually Costs

↓

Model Inference: $0.01-0.10 per request

GPT-4, Claude Opus tier models. Depends on input/output length. Smaller models (GPT-3.5, Claude Haiku) cost $0.001-0.01.

Web Search: $0.001-0.005 per call

API contracts with Bing, Google, Brave. Caching reduces cost for frequent queries.

Code Execution: $0.0001 per run

Amortized sandbox cost. Containers spin up on demand, share resources across users.

RAG Retrieval: $0.0001-0.001 per query

Embedding generation + vector database query. Scales with corpus size.

Total cost per 1000 requests: $5-50 depending on model choice, tools used, and complexity. This is why ChatGPT Plus ($20/month) and Claude Pro ($20/month) exist - subscriptions assume 500-2000 requests/month. Heavy users cost more than subscription revenue. Light users subsidize them.

AI agents: how the system handles multi-step work

Sometimes you do not want one answer. You want the system to handle a job with several steps, like checking an order, applying a policy, sending an email, and updating a record. That is where AI agents come in. An agent is simply a model plus tools plus a step-by-step loop for deciding what to do next.

AGENT WORKFLOW

The Three Components: Agent, Tools, Workflow

↓

🧠 Agent (the coordinator)

This is the part that reads the goal, picks the next step, and decides when the job is done. Think of it like a coordinator, not a magic brain.

🔧 Tools (the helpers)

These are the connected helpers the agent can use: search a knowledge base, check a database, send an email, update a system, or write a file. Each tool has rules about what input it expects and what it returns.

🔄 Workflow (the step-by-step loop)

The workflow is the repeatable loop. Read the goal. Pick a step. Use a tool. Look at the result. Decide the next step. Stop when the task is done or hand it to a person if it gets stuck.

agent-platform.company.com/workflows/customer-refund

GOAL

"Customer claims their order #8472 arrived damaged. Process refund and send confirmation email."

                  lookup_order(order_id: str)

Queries order database. Returns order details, status, items, customer info.

                  check_refund_policy(reason: str, order_date: str)

Checks if refund is allowed per company policy. Returns eligibility and conditions.

                  process_refund(order_id: str, amount: float, reason: str)

Initiates refund via payment processor. Returns transaction ID and estimated completion.

                  send_email(to: str, template: str, variables: dict)

Sends templated email to customer. Returns delivery status.

                  update_crm(customer_id: str, notes: str, tags: list)

Updates customer record in CRM. Adds notes and tags for future reference.

Agent receives goal and available tools

The model sees the goal, the tool descriptions, and any earlier conversation.

↓

Agent plans the next action

Example thought: "I need order details first"
Action: call lookup_order(order_id="8472")

↓

Tool runs and returns a result

Observation: Order found, total $89.99, status "delivered", customer email: [email protected]

↓

Agent updates plan and continues

It loops back with the new result, then keeps going until the task is done or it needs help.

Error handling:

→Tool fails: Agent sees error message, tries alternative approach
→Ambiguous result: Agent asks clarifying question or makes best-effort decision
→Max iterations hit: Agent summarizes progress, flags for human review

PRODUCTION PATTERNS

How companies put agents into real products

↓

How the system keeps its place

These systems need to remember where they were. Teams save the current state, which means the conversation, tool results, and next step, in a database. If the process stops or times out, it can restart from the last saved point instead of beginning again.

Where a person still has to approve

For important actions, a human still needs to sign off. The agent prepares the action, pauses, and waits for approval. This is common for money movement, risky customer messages, deletions, or policy exceptions.

Safety rules around tools

Not every tool is safe to let loose automatically. Real systems put clear limits around what the agent can do:

→Rate limits per tool (max 10 DB queries per workflow)
→Permission checks (agent can read orders, not delete them)
→Cost limits (max $5 in API calls per workflow)
→Audit logs (every tool call logged with timestamp, inputs, outputs)

How teams watch what the agent is doing

If you cannot see what the agent is doing, you cannot trust it. Teams track simple things like success rate, number of steps, common failures, and cost:

Success rate

Goal completion %

Avg steps

Tool calls per workflow

Failure modes

Where agents get stuck

Cost per run

LLM + tools combined

Tracing tools follow each workflow from start to finish. Dashboards show slow steps, repeated loops, and failures so teams can fix the system before it causes trouble.

Common software used to build these systems

LangGraph (LangChain)

State machines for agent workflows
Built-in persistence and checkpointing
Python-native, integrates with LangChain tools
Good for complex multi-agent systems

CrewAI

Multiple agents working together
Role-based agents (researcher, writer, reviewer)
Sequential and hierarchical workflows
Good for content generation pipelines

AutoGen (Microsoft)

Conversational agent framework
Agents that talk to each other
Built-in code execution and debugging
Good for research and analysis tasks

Custom (Roll Your Own)

Direct LLM API + tool calling
Full control over how the workflow runs
Minimal dependencies, max flexibility
Good for production at scale

Going deeper: how tokenization actually works

The intro above uses "tokens" as shorthand. This section is for people who have already built on top of an LLM API or are writing prompts professionally and want to understand what is actually happening at that layer. The design decisions made here have real consequences for model behavior, cost, and context window math.

The model never sees words. It sees integers. Every piece of input, your query, the system prompt, retrieved context, conversation history, gets converted to a flat sequence of token IDs before a single matrix multiply happens. How that conversion works matters more than most people realize.

TOKENIZATION 01 - VOCABULARY DESIGN

Three ways to slice text: characters, words, and subwords

↓

A vocabulary is the complete set of units a model can recognize and produce. How you define those units involves a hard tradeoff with no perfect answer.

Word-level

Split on whitespace and punctuation. Intuitive, but vocabulary bloat kills you fast. English surface forms explode once you account for plurals, conjugations, compounds, proper nouns, typos, and domain jargon. Anything outside the fixed vocabulary at inference time becomes an unknown token, breaking the prediction chain. Pre-2018 NLP systems lived and died by this problem.

↓

Character-level

Every Unicode codepoint is its own token. Vocabulary stays tiny, you never hit an unknown token, and you can handle any script. The cost: sequence length explodes. English words average around 4-5 characters plus spaces, so a 512-token word-level context becomes roughly 2,500 tokens at the character level for the same passage. Learning long-range dependencies across that expanded length is significantly harder to train.

↓

Subword (what everything uses today)

High-frequency strings get their own token. Low-frequency strings get decomposed into smaller pieces that individually appear more often. You get vocabulary coverage close to character models without the sequence length penalty. GPT-4, Claude, Llama, Gemini, and every other production LLM you interact with uses a subword scheme.

TOKENIZATION 02 - ALGORITHM

Byte Pair Encoding: building a vocabulary from frequency

↓

BPE is a bottom-up compression algorithm adapted for vocabulary construction. The loop is simple in principle:

Initialize with individual characters

Start with a vocabulary of every character or byte present in the training corpus. At this point the scheme is functionally character-level tokenization.

Count every adjacent pair across the corpus

Scan the entire training dataset and count how often every pair of adjacent tokens appears together. "i" + "n" might co-occur millions of times. "q" + "x" almost never.

Merge the most frequent pair into a new token

Add the merged string as a new vocabulary entry. Replace every occurrence of that pair in the corpus with the new token. Vocabulary size increases by one. Corpus token count drops by the number of merges that occurred.

Repeat until you hit your vocabulary budget

GPT-2 stopped at 50,257 tokens. GPT-4 and Claude use vocabularies in the 100,000-150,000 range. Larger vocabularies reduce sequence length and can improve fluency on common text, but they also increase the embedding table size and the softmax computation at the output layer. It is a hyperparameter tuned against downstream benchmark performance.

The result: common strings like function or return get a single token. Rare words get decomposed into pieces, often slicing right through what a linguist would call morpheme boundaries. That is intentional: BPE optimizes for corpus frequency, not linguistic structure, and those two things do not align.

TOKENIZATION 03 - EDGE CASES

Where tokenization creates surprising model behavior

↓

If you have spent time prompting models at scale, you have probably run into behaviors that trace back to the tokenization layer rather than the model weights.

Character counting failures

Ask a model to count the letter "s" in "Mississippi" and it often gets it wrong. The model does not operate on characters directly. If "iss" merges into a single token, the individual letters inside it are not immediately recoverable from the model's internal representation. It has to reconstruct character-level information from the embedding, which is lossy.

Whitespace-sensitive tokenization

The token for "python" and " python" with a leading space are usually different IDs with different embeddings. Stripping whitespace from prompts can produce subtly different outputs. This matters if you are doing token-level log probability analysis or building evals that compare raw completions.

Rhyme and phonetics

Models struggle with rhyming more than their language fluency suggests they should. Two words can rhyme while sharing almost no subword tokens. The model has to infer phonetic similarity from spelling patterns, which are encoded unevenly across the vocabulary depending on how each word was tokenized.

Multilingual token inflation

A vocabulary built on an English-heavy corpus tokenizes English efficiently and everything else less so. A 100-token English passage might cost 300-500 tokens in a morphologically rich language with less training representation. Your effective context window shrinks for non-Latin input, and you burn more tokens per API call for the same semantic content.

TOKENIZATION 04 - EMBEDDINGS AND PRACTICAL NOTES

What the model "knows" about its own tokens

↓

Each token ID maps to a learned embedding vector. That vector is not engineered by hand. It starts random and gets updated continuously during training as the model learns to predict the next token across trillions of examples. By the end of training, the embedding for a given token implicitly encodes everything the model observed that token doing across the corpus.

This raises a real question: does the embedding for a token like deploy "know" it is spelled d-e-p-l-o-y? Formally, no. The token is just an integer. But research shows character-level information does bleed into embeddings. The most likely reason: the same root word surfaces in many tokenized forms across a large corpus (singular, plural, past tense, mid-sentence, sentence-initial, after punctuation, inside a code block) and the model is incentivized to build representations that generalize across those surface variations.

Practical implication: models handle spelling and character-level tasks better for common words (which appear in many tokenized surface forms during training) and worse for rare words or niche proper nouns that were almost always tokenized the same way. If you are building a prompt that requires character-level precision, explicitly spell out the string in your prompt rather than assuming the model can decompose it cleanly.

                Note for API users

Tokenizers are versioned separately from models. cl100k_base (GPT-4) and Claude's tokenizer have different vocabularies, different merge rules, and will produce different token counts for the same input. Never estimate token count by dividing word count by 0.75 for anything where cost or context window headroom matters. Run the actual tokenizer for the model you are calling.

~100K

Vocab size in modern LLMs

4-5

Avg chars per English token

2-5x

Token inflation for non-Latin scripts

BPE

Algorithm behind GPT-4 and Claude

Is AI actually safe?

Short answer: for the models you use every day, the sci-fi robot takeover is not the threat. The real risks are more boring and more immediate: disinformation, autonomous weapons, surveillance, and bias baked into hiring or sentencing systems. Safety is a real engineering discipline and there are multiple layers of it in every serious AI product. Here is what actually exists.

01 - SAFETY LAYERS

Training alignment: teaching the model what not to do

↓

Before a model ever reaches you, it goes through a process called alignment training. This is where the model is trained not just on what is factually correct but on what is helpful, harmless, and honest.

The most common technique is called RLHF (Reinforcement Learning from Human Feedback). Human reviewers rate thousands of model responses. The model is then updated to produce more of the good responses and fewer of the bad ones. Over many rounds, this shapes the model's behavior before anyone outside the lab sees it.

→The model learns to decline requests that could cause harm
→It learns to flag uncertainty instead of making things up confidently
→It learns to avoid helping with things that are clearly illegal or dangerous
→It learns tone, nuance, and when to push back on a bad idea

This is not perfect. Models can still be pushed in wrong directions with the right prompts. But the baseline behavior is trained, not bolted on as an afterthought.

02 - SAFETY LAYERS

System prompts and operator guardrails

↓

Every serious AI product runs on top of a system prompt. Before the model sees your message, the company or developer who built the product has already given the model instructions: what it can talk about, what it should refuse, what persona it should use, and what safety rules to follow.

Think of it as the employer giving new rules to a contractor. The model follows them. So even if the underlying model could technically answer something, the system prompt can block it entirely.

On top of that, most platforms run content filters that check both input and output:

→Input filters catch known harmful patterns before they reach the model
→Output filters check the model's response before it reaches you and block or rewrite anything that crosses a line
→Rate limits and monitoring flag unusual usage patterns and slow down or stop automated abuse

03 - SAFETY LAYERS

What the model can and cannot do on its own

↓

The model by itself cannot do anything. It reads text and writes text. That is the whole thing. It cannot reach out to the internet, execute code, send emails, move money, or control systems unless a tool is explicitly connected and the system running it allows it.

When a model does have tools, those tools are scoped. A customer support agent with access to your order data does not also have access to your bank account. The access is defined by whoever built the product, not decided by the model.

This is important because it means the risk surface is well-defined. An agent that can only read your calendar and draft meeting summaries is limited to exactly those two things. The worry about "the AI doing something unexpected" is really a question about whoever wired it up and what permissions they granted.

04 - SAFETY LAYERS

Red teaming and ongoing testing

↓

Before any major model ships, the company runs red team exercises. Red teaming is where a group of people try as hard as they can to break the model: get it to say something harmful, bypass its safety rules, produce dangerous content, or behave in an unexpected way.

What they find gets fixed before release. What slips through gets patched after. This is not so different from how security teams test software. You ship, you monitor, you patch, you repeat.

Anthropic (who makes Claude), OpenAI (ChatGPT), Google (Gemini), and Meta (Llama) all publish safety reports and run external third-party audits. This does not mean the models are perfect. It means there is a process, it is improving, and problems are documented publicly.

The doom question

Every few months someone writes a piece about AI ending humanity. Here is a more grounded take on what the actual risk picture looks like today.

🚫 Overhyped / not how it works

A model spontaneously "waking up" with goals
AI deciding on its own to take over systems
Terminator-style self-preservation instincts
Models conspiring without human instruction

🔴 Already happening right now

AI-generated disinformation at election scale
Autonomous targeting in active warzones
Deepfakes used for fraud and harassment
Biased models used in hiring and sentencing

⚠️ Real and getting harder to ignore

Multi-agent systems with minimal human oversight
Concentration of capability in 3-4 companies
Economic disruption faster than retraining allows
Alignment gaps at much larger model scales

✅ Already working reasonably well

Refusing clearly harmful requests
Transparency about being an AI
Scoped tool access in production systems
Public safety research and third-party audits

The practical reality: The models you use every day are tools. Very capable, sometimes surprising tools, but tools. They do not have goals. They do not want anything. ChatGPT and Claude both have persistent memory now by default, but that memory is yours to manage and delete. The risks worth taking seriously are not robots going rogue. They are the same risks you take seriously with any powerful technology: who controls it, what it has access to, how it is audited, and whether the people building it are honest about what it can and cannot do. The warfare and surveillance uses are the part that should genuinely concern you. The doom headlines about sentient AI are mostly a distraction from that.

That is the simple version of how Claude and ChatGPT work. Not magic. A stack of models, data, tools, and guardrails.

Behind one clean reply, the system may be predicting text, searching for context, calling tools, checking results, and managing cost and safety. Once you can see the pieces, the black box starts to look a lot more understandable.