Engineering onboarding often fails not because the documentation is bad but because documentation was never the whole job. The answer to "why does this service work this way" usually lives across Slack, PRs, ADRs, and tickets. Here's the architecture that retrieves it.
Writing more docs won't fix this
Most engineering onboarding programs are built around the assumption that the bottleneck is documentation. Write better docs, organize them better, link them better, and the new hire will find what they need. A lot of teams run this playbook and end up with the same pattern: a wiki that goes stale, a Notion that nobody can find, a Confluence that drifts out of sync with production within weeks of being written.
The bottleneck was never the writing. It's what happens when the new hire has a specific question - "why does this service have a 5% error budget" or "who owns the Kafka alert rules" - and the answer lives across a Slack thread from eight months ago, a closed PR, an ADR nobody can locate, and the head of someone who already left the company. The answer isn't a document. It's a connected trail of decisions, and documents are just one node type on that trail.
If you take that framing seriously, the architecture is not "better wiki." It's a connected knowledge graph, an identity source that knows what role the person is in, and an agent that traverses both on demand. That's the pattern this article describes.
The traditional answer to onboarding friction is a Confluence page that's six months out of date and a buddy who has their own work to do. Neither solves the problem. They distribute the friction across two failing channels and call it a process.
What new hires are actually hunting for
Most technical hires end up hunting for the same things. Which repos do I clone. Which clusters do I have access to. Why does the payments service have a 5% error budget instead of 1%. Who made that call. What thread did it happen in. Is the decision still valid. The useful context usually lives in Slack, in closed GitHub PRs, in ADRs nobody reads, and in the heads of people who might have already left the company.
The common developer experience instinct is to fix this by writing more documents. That helps at the margins, but it does not solve retrieval. Specifically: retrieval that is connected, role-scoped, and traversable, not just searchable.
The architecture follows directly. An identity source that knows who the person is and what they're being hired to do - ServiceNow, Workday, SharePoint, whatever your org runs. A knowledge graph that connects tickets, PRs, threads, ADRs, services, and owners. A retrieval layer that filters by role at query time, not just by relevance. An agent on top that can stitch the pieces into something a human can act on quickly.
The four layers of Sentinel Day One
Each layer does one job. None of them are novel in isolation. The architecture is in how they connect.
layer 1
Identity
Your employee catalog - ServiceNow, Workday, SharePoint, or any HRIS. Source of truth for who the person is, what team they belong to, what apps they're provisioned for, and what business functions they perform.
→
layer 2
Graph
Neo4j. Connects Slack threads, GitHub PRs, Linear tickets, ADRs, Confluence pages, services, and owners into a navigable structure of decisions and dependencies.
→
layer 3
Retrieval
Vector embeddings for semantic match. Graph traversal for context expansion. Role-scoped filtering at query time using the catalog identity as the filter key.
→
layer 4
Agent
Claude Sonnet. Reads the employee catalog on day one to provision the environment. Stays available after that as the context layer for ongoing questions.
The identity layer: ServiceNow, Workday, SharePoint - it doesn't matter
The first instinct when building something like this is to make the agent observe what the user does and provision reactively. Watch them open an IDE, detect which repo they're working in, surface relevant context. That sounds clever and it is the wrong answer. Behavioral observation is noisy, it's slow, and it raises privacy questions you don't want to spend the next two quarters answering for an enterprise client in healthcare or insurance.
The better answer is already sitting in your employee catalog. We built this on ServiceNow because that's what the client ran, but the pattern is catalog-agnostic. Workday, SharePoint, BambooHR, Azure AD, any HRIS with an API - they all carry the same core data: who the person is, what team they belong to, what applications they're provisioned for, and what business functions they perform. That's a complete enough profile to provision an entire dev environment before the new hire opens their laptop. The catalog is declarative, authoritative, and immediate. It also happens to be the system that already triggers when a new hire is onboarded, so the agent sits downstream of an existing event flow rather than inventing a new one.
ServiceNow
ITSM-heavy orgs, regulated industries (healthcare, insurance, financial services). SNOW is often already the provisioning trigger for IT access requests.
Employee profile, team, role, provisioned apps, business functions, active service catalog requests.
Workday
HR-led orgs where headcount and org structure live in Workday. Common in larger enterprises across all industries.
Worker profile, org hierarchy, cost center, job profile, manager chain. Usually needs a middleware layer to translate into access topology.
SharePoint / Azure AD
Microsoft-stack orgs. Groups and roles in Azure AD often map directly to access permissions already enforced elsewhere in the stack.
Group memberships, assigned licenses, department, manager. Azure AD groups frequently mirror repo and tool access already - least translation needed.
BambooHR / other HRIS
Smaller or faster-moving orgs. Less mature provisioning automation, but the employee data model is the same.
Department, job title, start date, direct manager. Access topology usually requires a manual mapping layer on first setup.
If your identity source for onboarding is "ask the new hire what team they're on," your onboarding pipeline has the same brittleness as a Confluence page. The catalog already exists - in ServiceNow, Workday, SharePoint, or wherever your org keeps its headcount truth. Read from it. The agent doesn't care which one it is.
Why Neo4j and not just RAG
RAG alone doesn't solve this problem. Vector search can get you semantically close to an answer but it can't traverse a decision graph. When someone asks why a service has a 5% error budget, the answer isn't in any one document. It's in an ADR that references a Slack thread from eight months ago that was triggered by an incident postmortem that spawned three Linear tickets, two of which are still open. To return that as a useful answer, you need to walk the graph from any starting point and bring back the connected story, not just the closest match.
Neo4j is the right shape for this because the relationships are first-class. A PR closes a ticket. A ticket references an ADR. An ADR was decided in a Slack thread. A thread mentions a service. A service is owned by a team. A team has a current on-call. These are not properties on a document - they are edges between entities. You want to query "give me the full decision chain for this service" and have the graph return the chain, not a ranked list of documents that mention the service name.
The two layers work together, not in competition. Vector embeddings get the agent to the right entry point in the graph. Graph traversal expands the answer from there. The retrieval layer assembles the result and the agent renders it into something a human can read in one screenful.
Lookup
Returns the closest document. Works for "what is the retry policy."
Returns the document, the PR that changed it last, the owner, and whether it's currently being revised.
Decision
Returns documents that mention the decision. Often misses the rationale and the trail behind it.
Returns the ADR, the Slack thread it came from, the incident that triggered it, and the open tickets still acting on it.
Ownership
Returns mentions of names. Stale if the person left the company.
Returns the current owner from the team graph, the historical owner, and flags handoff gaps explicitly.
Impact
Hard to express as a similarity query at all.
Native to graph traversal. "What depends on this service" is a one-hop query.
Role scoping is what stops the firehose
Because the employee catalog is the foundation of the system, every retrieval query is filtered by role from the very first call. But the more important point is what the catalog actually knows. It's not just "this person is an SRE." It encodes the full access topology: what they own, what they read, what they support but don't write to, what they need conditionally. The agent provisions from all of that, not just the primary bucket.
There's also a shared baseline every engineering hire gets regardless of role - Confluence, Jira, Slack, GitHub org read. The agent doesn't differentiate on those. What it does differentiate is the role-specific write access, the cross-team read grants (SREs read every app repo they support but own none of them, QA reads every app repo they test against), and the scoped credentials that carry real security implications if you get them wrong.
The same knowledge graph, queried through different role filters, produces completely different provisioning runs and completely different context surfaces. An SRE and a backend dev joining the same company on the same day should not have the same first morning.
Shared baseline - every engineering hire
Confluence, Jira / Linear, Slack workspace, GitHub org read, shared architecture overview docs, incident history read access, and the org's communication runbooks. The agent provisions these unconditionally. Role scoping starts at the layer above this.
App repos (e.g. payments-api)
Read/write. SREs contribute directly - OTel instrumentation, reliability patches, performance fixes found during incidents, health check endpoints. Embedded SRE stints mean full PR rights on some services.
Read/write. Primary workspace. Full clone, branch permissions, PR access.
Read. Needed to understand what's being tested, trace test failures to source changes, and write accurate selectors.
Shared platform / reliability libs
Read/write. SREs often own or co-own shared observability SDKs, health check libraries, and circuit breaker wrappers that every service imports. This is core SRE output, not just infra work.
Read. Devs consume these libraries but don't own them.
Read. Needed for understanding what instrumentation is available to assert against in tests.
Infra repos (e.g. infra-core, k8s-platform)
Read/write. Primary workspace. Terraform, Helm charts, cluster config, alert rules.
Read. Devs need enough infra context to understand how their service is deployed and what's configured around it.
Read. Needed to understand environment topology, what staging looks like vs. prod, and where load test targets actually land.
Test repos (e.g. e2e-tests, load-tests)
Read/write. SREs own chaos engineering suites, synthetic availability tests, and load test scenarios for SLO validation. These live in test repos and are SRE-authored code.
Read. Devs should know what e2e coverage exists before landing a change that breaks it.
Read/write. Primary workspace. Full test authoring, fixture ownership, CI pipeline config.
Kubernetes cluster access
Full - prod, staging, dev. SRE owns cluster operations across all environments.
Dev and staging namespaces only. Scoped to their service's namespace. No prod exec or cluster-wide view.
Staging and dev only. Read-only. Enough to inspect pod state during test failures.
Vault / secrets
sre/platform scope. Cluster certs, infra credentials, alert integration tokens.
app/service scope. DB credentials, API keys, service-to-service auth for their service only.
staging-env scope only. Test data credentials and staging integration tokens. No prod secrets.
Observability platform
Full access. Dashboard authoring, alert management, DQL / SPL query access across all services.
Read. Service-scoped dashboards and traces for their app. No cross-service alert ownership.
Read. Enough to correlate test run failures with trace data and surface timing anomalies during load tests.
CI/CD (ArgoCD, GitHub Actions)
Full platform access. Owns the deployment pipeline infrastructure, rollback capability, and environment promotion gates.
Write access scoped to their service's pipeline. Can trigger, rollback, and configure their own deploy workflow.
Write access to test pipelines only. Can trigger test runs, modify test workflow config, gate on coverage thresholds.
Alert rules and on-call config
Full. Owns alert authoring, threshold configuration, escalation policy, and PagerDuty schedule management.
Read. Devs should know what alerts fire on their service but do not own the alert config.
None by default. QA can request read access for specific services if their test scope requires it.
Test management (TestRail, etc.)
None by default.
Read. Devs should be able to see what test coverage exists against their service.
Full. Test plan authoring, run management, coverage reporting, and regression tracking.
The agent reads the full access topology from the employee catalog and provisions all of this in a single pass. The cross-team access grants are the part most manual onboarding processes get wrong. Modern SREs write code across app repos, own shared reliability libraries, and author chaos suites - treating them as infra-only misprovisioned the laptop before the person even opened it. The catalog knows the reality of the role. The agent just has to read it correctly.
Live demo - three personas, one agent
The demo below runs three new hires through their first morning. Pick a persona - SRE, backend dev, or QA - and watch the agent read the employee catalog entry, provision the environment, and surface the role-scoped backlog. Use the pre-seeded quick prompts or type your own question. The graph traversal is simulated; for freeform questions the backend is a live Claude Sonnet call constrained to the persona's role context.
How to use this
Switch personas using the tabs at the top. Each boot sequence is different - the identity source, the repos cloned, the vault scope, and the backlog items all change. Then use the quick prompt chips or type a question in the terminal input. Pre-seeded questions hit the simulated knowledge graph directly; anything else routes to Claude with a role-scoped system prompt.
What this replaces in the traditional onboarding flow
The honest comparison is not "Sentinel Day One vs nothing." It's Sentinel Day One vs the current state of engineering onboarding at most companies, which is some mix of a Confluence page, a buddy program, a checklist in Notion, and a ticket workflow in ServiceNow that ends when access is provisioned and starts when the new hire opens Slack to ask their first question.
Day 0 prep
HR ticket triggers IT provisioning. Buddy is assigned in a calendar invite. New hire receives a welcome packet.
Same triggers, but the agent reads the employee catalog entry (ServiceNow, Workday, SharePoint - whatever the org runs) and pre-stages the environment. Dev tooling, repos, cluster access, and backlog are ready before the laptop boots.
Day 1 morning
New hire logs in, can't find half the docs, asks buddy. Buddy is in meetings. Hire reads outdated Confluence.
Agent walks the new hire through what's been provisioned. Surfaces the three to five backlog items relevant to their role and the architecture context they need to understand them.
First context question
"Why does the payments service have a 5% error budget?" Asks in Slack. 30-minute back and forth across three threads. Half-answered.
Same question to the agent. Returns the ADR, the Slack thread it came from, the outage that triggered it, the current owner, and the open ticket still acting on the decision.
First week
A large fraction of the new hire's time is scavenger hunting. Buddy is interrupted constantly. Productive contribution starts in week three.
Scavenger hunting is replaced by direct queries. Buddy is freed for actual mentorship. Time to first meaningful contribution drops materially, with the exact gain dependent on codebase complexity and role.
The thing this replaces is not the documentation. The documentation can stay. What this replaces is the human friction of finding, interpreting, and connecting the documentation - which is the part nobody scales by writing more docs.
What this doesn't do
Sentinel Day One is not a code generation tool, it does not pair-program, and it does not write the new hire's first PR. The role it plays is context retrieval and provisioning, not authoring. Conflating the two is how you get an onboarding tool that produces confident wrong answers about the codebase architecture instead of correctly pointing to the human decision trail.
The buddy doesn't get replaced either, the buddy gets freed up. Answering the same five questions every new hire asks is the wrong use of a senior engineer's time. Having a real conversation about the team's roadmap and where this person fits is the right use. Sentinel Day One moves the rote questions off the buddy so the buddy can do the part only a human can.
The tool also doesn't pretend to have answers it doesn't have. When the graph doesn't contain the context to answer a question well, the agent says so and surfaces the closest related entities instead. Confident hallucination is not acceptable in an onboarding tool that the new hire will treat as authoritative. The system prompt enforces "I don't have that context, here's the closest related entity" as a first-class response.
Where AI actually earns its keep here
There's a real role for AI here and it isn't "AI buddy" or "AI mentor," both of which misread what makes onboarding hard. The teams getting real value from AI in onboarding are using it for retrieval and synthesis, not relationship substitution.
Walking the graph, not searching for keywords
The LLM is good at walking a knowledge graph when given a starting node and an intent. "Find me the decision history for this service" becomes a structured traversal query that the agent assembles into a readable answer. The graph does the work of knowing what's connected to what. The agent does the work of explaining it in context.
Same context, different framing per role
An SRE asking about the payment service wants to know the SLO, the incident history, and the runbooks. A backend dev wants the API contracts, the retry policy, and the open spec questions. Same knowledge graph, different system prompt. The LLM handles the reframing cleanly without duplicating the underlying data.
Every answer shows its sources
For every answer the agent gives, it cites the source entities - the specific Slack thread, the specific PR, the specific ADR. This isn't a footnote, it's the answer. The new hire's next click is the source, not a follow-up question. Without this, the agent is a black box. With it, it's a trusted index over the company's actual decision history.
Provisioning as a multi-step agent task
Reading the employee catalog and turning it into provisioning calls across GitHub, vault, kubeconfig, and CI/CD is a multi-step workflow with retries, error handling, and verification at each step. An agent runs this reactively and reports what landed, what failed, and what needs a human.
AI in onboarding earns its keep as a fast path through an existing knowledge graph, not as a replacement for the knowledge graph itself. The teams getting value from this are using it to make context that already exists findable and connected. The job is not to invent answers. The job is to stop making new hires spend their first week hunting for them.
What it actually takes
Your employee catalog for identity. Neo4j for the connected graph. Embeddings for semantic entry points. An agent on top. Where teams get this wrong is by inverting the order - building the agent first, then trying to retrofit the knowledge graph, then realizing they have no clean identity source and trying to invent one. The order matters.
The hard work is not the model. The hard work is the graph - assembling Slack, GitHub, Linear, ADRs, Confluence, and your service catalog into a single connected structure where the relationships are explicit and current. Once that exists, the agent is just a query interface. The model is interchangeable. The graph is not.
01
Treat onboarding as a retrieval problem, not a documentation problem
The information already exists, scattered across Slack, PRs, tickets, and ADRs. Writing more documents doesn't make the information findable. Making the information traversable does.
02
Use the identity source you already have
ServiceNow, Workday, SharePoint, BambooHR - some catalog already knows who this person is, what team they're on, and what they're being hired to do. Read from it. Don't ask the new hire to tell the agent what team they're on.
03
Build the graph before the agent
An agent over no graph is a chatbot that confidently hallucinates. A graph with no agent is a database nobody queries. The graph is the moat, the agent is the surface. Build them in that order.
04
Scope retrieval by role at query time, not by index
One graph, many roles. Scoping at query time means an SRE and a backend dev see different views of the same connected truth. Indexing per role multiplies your storage cost and creates parallel realities you'll have to reconcile later.
05
Surface provenance in every answer
The Slack thread, the PR number, the ADR, the owner. Every agent response should let the new hire click through to the source. Without provenance, the agent is a black box. With it, the agent becomes a trusted index over the company's actual decision history.
06
Provision reactively from the identity source, not from a static playbook
Playbooks assume you know what every new hire needs. Identity sources tell you. Provisioning driven by the employee catalog adapts to role, team, and function changes without anyone updating a checklist.
07
Constrain the model to retrieval and synthesis
The agent is not a pair programmer, not a mentor, not a code reviewer. The agent is a context layer. Keeping the scope tight is what keeps the answers honest. The buddy and the senior engineer still exist - they just have less scavenger hunting to do.
08
Measure the right outcome
The metric isn't "questions asked of the agent" or "documents indexed." The metric is "time to first PR" and "buddy interruption rate." The first measures the new hire's velocity. The second measures whether the friction is actually gone or just relocated.
The goal isn't a smarter onboarding chatbot. The goal is a new hire who, within their first week, can answer "why does this service work this way and what's the open work on it" without scheduling a meeting, opening Confluence, or paging a senior engineer. Build toward that outcome directly. Everything else is decoration.