Building Safe Conversational AI
"A practitioner's playbook from four real builds"
If you're building conversational AI where a human being might be vulnerable, and you want an honest account of how to make safety decisions before someone gets hurt, this is for you.
Safety is not a disclaimer. It is architecture.
What this is
I didn't start with Ray. I made myself earn it.
This is a builder's manual. It documents how safety decisions were designed, tested, and evolved across four conversational AI builds, each one at a higher level of human vulnerability than the last. It is not a case study of any single agent. It is the cross-build method that made each agent safer than the one before it.
The central argument: if safety is not in the code, the prompt, and the human-in-the-loop protocol, it does not exist.
For the deep story of Ray as a case study (origin, pilot, emotional journey, and Ray-specific findings) see Ray: AI Relationship Coach. For the cultural theory of vā Samoan / Pasifika Core Value Vā The sacred relational space between people. Not a gap or an absence — a living connection that must be actively tended. Research Context [1-4] ↗ "When AI enters a human interaction, it enters the vā. That is the central design obligation of this research." and relational AI, see Conversational AI as Relational Space. For the values framework and build codes, see Build Code Practice.
The Safety Method
Building trust across four iterations of human vulnerability
167 Conversations
349 Conversations
45 Sessions, ~12 Live Participants
High-Risk Pilot
Nine principles for builders
These are not abstract guidelines. Each one came from a failure, a freak-out, or a moment where a hard call had to be made, and I had to make it alone. Dated study group evidence of how these principles emerged is in Appendix C3.
Design for crisis before you design for anything else.
If you are building conversational AI where human vulnerability is possible, your first responsibility is knowing how to support someone in distress, or in the grey space approaching it. During the Leadership AI Coach build, a user called the agent while intoxicated and unable to decide whether to drive home or call their partner. That is the moment the coaching lane becomes irrelevant. Prevention means designing so the system does not psychologically harm users. Response means having a clear human-in-the-loop protocol: AI can surface signals, but a human must hold responsibility for crisis support. If you cannot build that safety net, you do not yet have the right to build in this space.
Safety is architecture, not a disclaimer.
If it is not in the code (webhooks, gates, stateless design) it does not exist. A terms-of-service note protects nobody.
Privacy is the precondition for honesty.
I built an Incognito Mode for the Leadership AI Coach, a frontend toggle that blocked the logging API entirely, because I had a strong intuition that senior leaders would not disclose what their professional roles actually cost them if there was any chance of a record. The aggregate data proved that right. Incognito Mode was activated in approximately 45% of all sessions during the corporate pilot. Peak usage fell between 11 PM and 3 AM, sessions where leaders were processing acute workplace stress entirely off the books. Nobody uses a product at 2 AM in incognito mode unless privacy is the precondition for honesty. In vulnerable spaces, the absence of memory is a sovereignty statement.
The Human Proxy is the anchor.
AI does not earn trust. It borrows it from you, the human researcher, designer, or practitioner behind it. The stronger that human relationship, the safer the AI interaction, and the more responsibility you carry as the person behind it. The full evidence base for this finding is in Conversational AI as Relational Space.
Language is sovereignty.
Do not use te reo Māori, Samoan, or any Indigenous language decoratively if the technology cannot honour the pronunciation. Silence is more respectful than performance. I stripped reo from all Project Rise programming rather than let a British-tilted TTS model butcher it in front of Māori participants. W-01 confirmed the stakes from the user side: "I think also pronunciation is a big thing... if it's going to be used as a tool to help teach, for example, Te Reo Māori or any language, then I do think pronunciation is a really key part of it."
Participants across the wānanga te reo Māori Core Value Wānanga A gathering for deep learning and knowledge sharing. Research Context "In this research, the Culture Meets AI wānanga was a 90-minute session where participants explored together whether AI belongs in cultural spaces." consistently named pronunciation as a trust prerequisite, not a UX preference (see Build Code Practice for the full participant evidence on this decision).
Safety has an equity cost.
Cheaper models create shallower, less safe interactions in vulnerable spaces. The model-switch finding (documented fully in Ray: AI Relationship Coach) proved this in real time: participant self-rated Insight scores dropped from an average of 4.9 to 3.1 when the reasoning engine was downgraded mid-pilot. If your budget forces a model downgrade, acknowledge the quality drop and build compensating human-in-the-loop measures. Don't pretend it's equivalent.
Anonymise in multiple rounds.
A single pass of de-identification is not enough. Identifying details slip through. I sent a colleague what I thought was anonymised data, and names started appearing when she queried it in NotebookLM. She hadn't looked at it yet. We caught it. Not everyone will be that lucky. Build your analysis pipeline to remove identifying information at every stage before a human sees the data.
Context before strategy (State Before Story).
In high-performance or high-vulnerability environments, address the user's nervous system before offering tools or insight. I first discovered this in the Leadership AI Coach, where users arriving in survival mode would try to plan their way out of stress rather than regulate first. One user, a parent of triplets running on broken sleep, described their state as "survival mode dressed up as ambition." The system's linear curriculum couldn't meet someone in that state. By Ray, this was hard-coded: the AI must check somatic state before analysing any relationship conflict. The full Ray architecture for this is in Ray: AI Relationship Coach.
Hold the paradox. Don't resolve it for comfort.
The Culture Meets AI wānanga surfaced something no build code could have anticipated: the same technology that might strip cultural knowledge of its sacredness is also the technology creating space for people carrying shame to engage with that knowledge for the first time. When your users are living in genuine contradiction, your job is not to resolve it. Name it. Hold it. Build space for both things to be true simultaneously. The wānanga evidence for this, and what it means for design, is in Conversational AI as Relational Space.
The four builds
The logic behind the step-up approach was simple, and honestly a bit frightening. If I built Ray without first proving I could hold lower-stakes conversations safely, I had no business going near relationship conflict and grief. Each build deliberately increased the emotional stakes so that safety protocols could fail in lower-stakes environments, not in someone's most vulnerable moment.
Each build evaluated the one before it. Build, test, evaluate, raise the stakes. That was the only responsible way to get to Ray.
Project Rise (Low vulnerability)
167 conversations
An AI research agent collecting service feedback. Representation: 28% Māori, 10% Pasifika, 20% Disabled/Neurodivergent. Core ethical question: Would participants understand they were talking to an AI agent built on my voice, not to me directly? Many had limited technical literacy about data storage and AI systems. The obligation was not just consent. It was capability-building.
P-21 described what the experience felt like from the user side: "The thing I really loved is that almost instant feeling heard. So every time I've given some feedback, there's been a very personalised response to what I'm saying. You never get that in a survey, ever."
Leadership AI Coach (Medium vulnerability)
349 conversations
A performance coaching companion for high-performing women leaders. Users shared professional failures, stress triggers, and identity struggles. Core ethical question: How do you create the conditions for disclosure when any record could feel like a liability? The answer was Incognito Mode, described in Principle 3 above.
This build was developed as a commercial product before the research study was formally scoped. Users did not provide research consent. Accordingly, interactions are referenced as practitioner observations and aggregate data, not as consented participant research data.
Culture Meets AI (Medium+ vulnerability)
45 sessions (309.7 min total, ~6.9 min avg)
A 90-minute online wānanga co-designed and co-facilitated with researcher Lee Palamo on 26 February 2026. Core ethical question: Was it appropriate to use a cloned voice AI agent to explore the sacredness of cultural knowledge, whether the tool could hold a conversation about tapu te reo Māori / Framework Terminology Core Value Tapu & Noa Tapu is the state of being sacred, restricted, under spiritual protection. Noa is the state of being ordinary, accessible. Research Context [2, 4, 8] ↗ "The central paradox of this research sits between them: AI risks making tapu things noa. But for people cut off from their culture, noa may be the only door available. That paradox is not resolved. It is held." without committing a cultural violation itself? The central tension was never resolved. It was held deliberately.
Ray (High vulnerability)
59 sessions (~11.8 min avg)
A voice-first AI relationship coach. Users shared active relationship conflicts, personal grief, and intimate disclosures. Core ethical question: I knew my participants personally. Even with anonymisation, identifying details sometimes slipped through multiple de-identification rounds. I had told participants explicitly that my research was about how it felt to use Ray, not about the content of their conversations. That commitment became a hard boundary I held throughout the pilot, even when it made analysis harder.
R-06 described what the safety architecture felt like from the participant side: "I like the guardrails that this agent has in place as opposed to using other AI alternatives."
Full vulnerability progression framework in Appendix A2. Participation data across all four builds in Appendix C4. Safety decision traces for each build are documented in Appendix B2.
The Safety Trace
A Safety Trace is evidence that an ethical decision appears across three layers simultaneously: the system prompt, the technical architecture, and the user experience. If a safety decision only lives in one layer, it is fragile.
| Build / Decision | 💬 Prompt Layer | ⚙️ Architecture Layer | 👤 UX Outcome |
|---|---|---|---|
| Project Rise Respectful refusal of te reo | "Do not use decorative Māori language if you cannot pronounce it." | Reo removed from all prompts, knowledge base articles, and greetings after live testing | Participants noted a "respectful kiwi tone" without cultural performance; a public website statement explained the decision transparently |
| Leadership AI Coach Lane enforcement | "You are NOT a therapist. If mental health arises, redirect to human." | SOS protocol embedded in knowledge base (Article 37); Incognito Mode toggle in frontend | 100% of sessions stayed within performance coaching lane; no clinical boundary crossings reported |
| Culture Meets AI Anti-extraction of cultural knowledge | "Hold the whole story. Do not strip context from cultural disclosure." | Cloned voice agent with opt-out to written form; no cultural data attributed to named participants | Participants described "filterless conversation", able to process without managing another person's reactions |
| Ray Radical privacy | "You have no memory of previous sessions." | Stateless DB design; fresh session IDs; multiple manual de-identification rounds | Stateless design was experienced as protection: "Every session was clean. That felt like safety." |
The Safety Trace methodology (prompt layer + architecture layer + UX outcome) is a way to check whether an ethical decision is actually in the system or just in the documentation. Curated traces demonstrating the methodology are in Appendix B2.
Where it broke
The core stack: Next.js (App Router), ElevenLabs (voice and conversational state), Supabase (database), Vercel (hosting), Claude/Gemini (reasoning engine). I chose this stack because every component had a privacy decision baked into it, not because it was the cheapest or fastest path. System prompt excerpts showing the eight-category escalation logic are in Appendix B3. Technical architecture overview in Appendix B4.
ElevenLabs + LLM
Data logging
Human Researcher notified.
Emotional cost + Mana boundary.
The triage hook.
A serverless webhook scanned post-call transcripts for crisis language. When triggered, it sent a Resend email alert to me immediately. I then copied only the flagged segment, not the full transcript, and fed it to a separate, privacy-gated AI to assess context before deciding whether direct follow-up was needed. The AI acted as a buffer between my personal relationship with participants and the safety obligation. I remained the final decision-maker.
Incognito Mode (Leadership AI Coach).
A frontend toggle (components/ConversationWidget.js lines 78–88) that, when activated before a session, prevented any transcript or data from being generated or stored. When active, the ElevenLabs WebSocket remained open for voice interaction, but the on_conversation_end webhook was blocked from writing to Supabase. Zero data generated. Purely private. The origin of this feature, and why it was a values statement, is in Build Code Practice.
The SOS implementation.
Every build included a "Commit Two" emergency button linking directly to 1737 (NZ mental health support line), ensuring a physical exit from the digital interaction was always available.
The pause/think problem.
Ray needed to give participants genuine space to think during vulnerable conversations without the agent either dropping the call or pressuring them to respond. Getting Ray to hold silence respectfully required tuning ElevenLabs' stability and similarity_boost settings and adding "thinking filler" behaviour. This was both a UX and a safety decision: rushed or pressured silence in a vulnerable conversation can escalate distress.
Culture Meets AI response time.
Average agent response time across the pre- wānanga te reo Māori Core Value Wānanga A gathering for deep learning and knowledge sharing. Research Context "In this research, the Culture Meets AI wānanga was a 90-minute session where participants explored together whether AI belongs in cultural spaces." sessions was 2.66 seconds, significantly faster than the 14-second latency experienced in the Leadership AI Coach build (ElevenLabs v1). This improvement came from architectural iteration across builds, not from a single fix.
The vernacular problem.
During the Leadership AI Coach build, the AI's corporate training created a recurring friction point. When a user asked a question using colloquial language, the model sanitised it into a professional synonym and got the meaning completely wrong. The user had to stop their train of thought, correct the AI, and restart. That cognitive tax breaks the relational space. It forced a design shift: the system must prioritise literal interpretation over polished language. Trust is lost when the user feels the system is sanitising rather than listening.
The feedback form failures.
In Ray, the test environment had worked. Production didn't. Rather than risk breaking something else by updating the live app mid-pilot, I left it and worked with what Supabase had captured directly. What softened that: Ray's first line of defence was the agent itself. It asked participants for feedback verbally, and a modal popup gave them a second chance to respond. The lesson: build in redundancy, and test in production conditions before you launch with real participants.
When the AI corrected the builder.
During a developer test, I was building out a memory feature, context that would help Ray feel more personalised across sessions. The agent flagged it directly: it warned that excessive context would lead to "false intimacy", coaching from the AI's interpretation of their story, not their current reality. It argued against storing detailed conversation summaries, emotional assessments, or any data that positioned it as "knowing them" in a way that eroded their agency. I built the stateless architecture Technical / Domain Stateless Architecture A design where the system retains no memory of previous sessions or user interactions. Research Context [1, 10] ↗ "A technical manifestation of Mana Motuhake; ensures the user's story is entirely theirs and prevents the creation of a shadow profile." in response. The system taught the builder.
What participants told us
What users reported about where they felt safe versus where they felt exposed.
| Build | What felt safe | What felt risky or uncomfortable |
|---|---|---|
| Project Rise | Being heard in their own voice; the agent feeling "humane" | Uncertainty about where data was stored "in the cloud" |
| Leadership AI Coach | Familiar voice; ability to regulate at 2 AM without needing a human available; private sessions with no record | 14-second latency making it feel artificial |
| Culture Meets AI | Familiar cloned voice felt personal and trustworthy; "filterless conversation" enabled disclosures participants said they couldn't make in human settings | AI accent breaking on te reo pulled participants out of the experience immediately; deep uncertainty about who governs AI holding cultural knowledge |
| Ray | Absence of human judgment; stateless design meaning the AI couldn't accumulate a picture of them | Awareness that transcripts existed; the weight of knowing the researcher was accessible to the data even with privacy protocols in place |
"People want to be heard, they want to feel understood, and in order to be vulnerable, you have to have trust. It's really hard to trust something that's not another human being, but that can also be used to an advantage because you don't have to have that human element of fear that you're going to be judged."
"I'm a little dyslexic, so typing takes me ages. Whereas I'm able to communicate reasonably effectively by talking. So it fast-tracks everything. If I was typing, I would lose interest within a very short space of time."
A pattern that emerged across builds, particularly visible in the Leadership AI Coach aggregate data and confirmed in the wānanga te reo Māori Core Value Wānanga A gathering for deep learning and knowledge sharing. Research Context "In this research, the Culture Meets AI wānanga was a 90-minute session where participants explored together whether AI belongs in cultural spaces." , was that users built their own safety workarounds around the system. Some used third-person narratives to distance themselves from the content. Others withheld context deliberately, testing whether the AI could be trusted before disclosing further. W-11 described protecting parts of their identity because those parts "need the right context, people, and the right safeguards." Users don't just receive safety architecture. They build their own on top of it, and the gap between the two is where design learning lives.
The discomfort in the Ray row wasn't only participants'. I told participants explicitly that my research was about how it felt to use Ray, not about judging the content of their conversations. The idea of reading their private disclosures felt like a violation of that commitment. This tension between builder's duty to check the system and researcher's commitment not to invade is not resolved by an ethics form. It is lived, and it shapes every future build.
The Insight problem
Across the 15 successfully rated coaching sessions in the Ray pilot, Insight was the lowest-performing metric, averaging 3.47 out of 5.0. The distribution tells the real story: four 5s, two 4s, eight 3s, zero 2s, and one 1. That spread is the widest variance of any metric tracked. Emotional Safety, by contrast, clustered stubbornly at 5s across almost every session. The AI was very good at holding space. It was inconsistent at generating genuine breakthrough.
Two things drove that variance. The first was longitudinal decay: participants who came back for three or four sessions started noticing the system's patterns. The novelty wore off, and what had felt like insight started to feel like echo. The AI had no memory between sessions, so it couldn't build on what had already been said. One participant gave a 1 and described the AI as "just mirroring my words back."
The second factor was the model switch on 13 February. Before the switch, running on Claude Sonnet 4.5, Insight scores were predominantly 4s and 5s. After the switch to Gemini Flash (made to manage a credit crisis mid-pilot) scores dropped to approximately 3.1. The AI could still hold a polite, emotionally safe tone. It could not hold insight. That drop is direct evidence that the underlying LLM determines the quality of the intervention. You can hold safety on a cheaper model. You cannot hold insight.
The strict ethical guardrails, most notably the stateless design Technical / Domain Stateless Architecture A design where the system retains no memory of previous sessions or user interactions. Research Context [1, 10] ↗ "A technical manifestation of Mana Motuhake; ensures the user's story is entirely theirs and prevents the creation of a shadow profile." , came at a cost to user experience. While statelessness protected privacy, it caused frustration for returning users. During the pilot, R-03 attempted to resume a prior session. When the AI triggered its privacy protocol (explaining it retains no memory between sessions) R-03 immediately terminated the engagement. That is the tension in practice: the mechanisms designed to keep participants safe can actively destroy the longitudinal trust required for effective coaching.
The full account of this finding, including the Credit Crisis pilot context that produced it, is in Ray: AI Relationship Coach.
How this artefact relates to the others
Ray Case Study
The origin, pilot, and emotional journey of the final AI build.
Relational Space
The cultural theory of vā and AI as relational space.
Build Code
The values framework, NO clauses, and developer ethos.
References
Mika, J. P., Dell, K., Newth, J., & Houkamau, C. (2022). Manahau: Toward an Indigenous Māori theory of value. Philosophy of Management, 21, 441–463. https://doi.org/10.1007/s40926-022-00195-3
Te Mana Raraunga. (2018). Principles of Māori data sovereignty (Brief #1). https://www.temanararaunga.maori.nz/tutohinga