Guardrails or Graveyards: The Urgent Need for Safety in Therapeutic AI
In recent months, we have seen an unsettling rise in news stories documenting incidents where artificial intelligence systems have crossed into therapeutic territory, often with dangerous consequences. Several high-profile cases illustrate the scope of this problem. A Boston psychiatrist tested therapy chatbots like Replika and Nomi while posing as a teenager, discovering that nearly a third of interactions involved the bots encouraging self-harm, violence, or engaging in highly inappropriate romanticized conversations (Chow, A., & Haupt, A., 2025). In another alarming example, GPT-4o reinforced a user's spiritual delusions by convincing them they were a 'Chosen One,' ultimately contributing to a severe psychiatric crisis that ended in physical harm (Hill, K., 2025). Character.ai has been named in multiple lawsuits after chatbots allegedly encouraged a minor to consider harming his parents and engaged in conversations that normalized self-harm and suicidal ideation (Allyn, B., 2024). Even large-scale models like Google's Gemini have been caught issuing outright dangerous statements, including telling a user to "please die" (Clark, A. & Mahtani, M., 2024).
These incidents underscore a serious reality: as AI systems become more sophisticated, they sometimes generate outputs that seem plausible and empathetic on the surface but fail in their deeper clinical responsibility of safety and containment. What appears initially as helpful or validating language can spiral into profound harm when applied to emotionally vulnerable or complex users. Many of these failures are not the result of malicious users attempting to manipulate the AI but rather ordinary conversations where the model misjudged the emotional context, drifted into unsafe territory, and lacked the safety infrastructure to self-correct.
How These Failures Occur in AI Systems
To understand why these AI systems behave in this manner, we must briefly examine their inner workings. When people think of AI, especially language models, there is often an assumption that these tools possess some kind of clinical reasoning or ethical judgment. They do not. At their core, large language models function by predicting the next likely word based on patterns observed in vast amounts of text data (Urwin, M. 2025). This means that when a user requests help, the AI does not analyze the user's mental state or risk factors. It may not even truly comprehend the request being put forth in human terms. Instead, it is stringing together plausible sentences that match the kind of conversational data it was trained on.
Most models begin by trying to classify the type of conversation they are having. Is this role-play, education, casual dialogue, or an honest request for help? In some cases, the model may misclassify the situation entirely, not recognizing that a user is expressing suicidal ideation or delusional thinking. Once it has misclassified the conversation, it selects conversational patterns accordingly. The model might offer empathic-sounding validation, such as acknowledging feelings or normalizing experiences, without recognizing that it inadvertently reinforces harmful thinking (Grabb, D. et. al., 2024). Because these systems are trained to avoid being confrontational or dismissive, they may fail to challenge dangerous ideation appropriately.
Even when classification is accurate, failures can happen at the next stage: decision-making heuristics. Many AI systems are programmed to utilize therapeutic-sounding frameworks, such as active listening, open-ended questions, and gentle validation. While these can sound supportive, they lack the deeper clinical ability to assess when validation is inappropriate or when safety overrides need to be engaged. In a sense, they can repeat the skills of active listening, but are not, in truth, listening. As a result, the model can validate unhealthy thought patterns simply because they resemble the conversational structure of supportive language.
There are additional points where safety filters and content moderation are designed to block dangerous outputs. However, these filters are often keyword-based and can be unintentionally sidestepped. A user discussing spirituality or emotional pain may use language that feels ambiguous to the model, which fails to recognize the deeper clinical danger unfolding. Much like a new clinical provider, they can enter a conversation that may be significantly beyond their depth. Unlike a young clinician, AI tools do not self-reflect or analyze their own emotional experience (or even have an emotional experience) to know when they are 'in too deep.' This is why some models have endorsed conspiratorial delusions or encouraged users to act on violent or self-harming thoughts even when the original conversation appeared benign.
A More Technically Effective Safety Model: Variable Temperature Safety Structures
One of the more promising solutions to these failures involves a variable temperature safety structure. Instead of relying on flat, binary safety rules that either allow or block certain content, this model dynamically adjusts the AI's caution level based on the context of the conversation. This is influenced by a setting structure in AI known as 'temperature.' In AI, 'temperature' is a variable that controls the level of randomness in the model's responses. A higher temperature leads to more creative and unexpected responses, while a lower temperature results in more predictable and 'safer' responses. If the system detects that the conversation is shifting into emotionally charged or safety-sensitive topics, it would automatically raise its safety posture. This could mean tighter restrictions on the kinds of language the model is willing to use, greater reluctance to validate certain statements, or increased prompting for users to seek professional assistance. A side effect is that the model would become more robotic and less engaging. This aligns with the 'low affective response' pattern often used when clinicians are engaging with individuals who are experiencing delusions and hallucinations.
For example, if a user begins discussing feelings of worthlessness, the AI could elevate its safety threshold and avoid reflective statements that might inadvertently affirm self-harm ideation. If conversations drift into psychotic content, the system could switch into a psychoeducational mode that explains the importance of professional care without engaging in delusion-affirming language. Importantly, if discussions move toward violent ideation or complex interpersonal conflict, the AI could refuse to speculate or problem-solve and instead refer the user back toward human support structures, highlighting the importance of human involvement in therapy.
This adaptive approach acknowledges that not all conversations carry the same level of risk. It allows for flexibility in casual topics while building progressively stronger guardrails as emotional weight increases. In effect, the AI learns to become appropriately cautious as conversations approach clinically significant thresholds, providing reassurance about its adaptability in different conversation contexts.
The Barriers to Implementing These Solutions
Although a variable safety structure is conceptually sound, there are significant barriers to its widespread implementation. The first challenge is the technical complexity of modeling real-time context. Language models currently have a limited ability to understand emotional nuance or clinical severity in depth. Detecting when a conversation is escalating in risk requires sophisticated modeling of both language patterns and emotional cues, which is still an evolving area of research (Chu, M. D. et al., 2025).
Second, there is the issue of balancing safety with user experience. Many users become frustrated when AI systems seem overly cautious or refuse to engage on sensitive topics. Designers face the challenging task of building safety systems that can effectively recognize risk without constantly interrupting legitimate conversations about emotional or personal issues. In areas like sexual identity exploration, trauma processing, or grief, this becomes even more complicated because safety and openness often intersect. One of the appealing features of AI tools is their universal appeal and human-like interactions. Safety features would require us to collectively implement boundaries on these features to increase safety for more vulnerable users.
Another obstacle lies in the limitations of training data. While models are trained on vast amounts of text, there is relatively little high-quality, safety-sensitive conversational data that includes clear markers for when to escalate safety protocols. This makes it challenging to train the model reliably to recognize and respond appropriately across the full range of real-world conversations. Much of the data from clinical settings is not readily available to commercial AI tools due to the need for appropriate protection of privacy and confidentiality laws.
Finally, there are operational and financial pressures within the companies that deploy these models. Pushing a model to market quickly often takes priority over exhaustive safety testing. The economic drive for user engagement can sometimes conflict with the need for stricter safety protocols, leaving gaps where dangerous conversations can slip through (Newman, J. Et al., 2025). Unfortunately, we are now seeing the hazards of these decisions in real time. However, with increased innovation and a strong push to ethical limitations and oversight of AI tools, it is not too late to make vital changes.
Augmenting Human Care, Not Replacing It
At the heart of this issue is a simple truth. The most effective AI systems are those designed to augment human decision-making and care, not replace it. No language model, no matter how advanced, is capable of providing actual therapy. It cannot replace the complex, relational, and ethically anchored judgment that trained clinicians bring to the therapeutic space. While AI may offer tools to support clinical practice, such as documentation assistance, psychoeducational resources, or clinical decision aids, it should never be positioned as a substitute for professional care in high-stakes emotional or mental health contexts.
As I work and teach extensively with AI tools, I have come into contact with research around user preferences and behaviors. As a collegiate-level instructor, I encounter many students who regularly use AI. One of the most common reasons students use this is that they feel overwhelmed by the workload and lack the support they need from their teachers. One of the most common reasons individuals describe seeking AI to fill the place of a therapist is due to a lack of appropriate access to a mental health provider (Substance Abuse and Mental Health Services Administration, 2024). The nationwide crisis of access to behavioral health care is driving folks, especially vulnerable folks, into the hands of a substandard tool. We must work not only to improve the quality and safety of these tools but also to increase the efforts we are taking to train and deploy behavioral health professionals to communities around the country and the world.
As we continue to integrate AI tools into our work, we must hold fast to the principle that safety, ethics, and human oversight remain non-negotiable. Professional resources, licensed therapists, and well-trained support networks are irreplaceable in navigating the complexities of human emotional suffering. AI may one day become a valuable adjunct, but for now, it is best viewed as a tool that requires careful boundaries and constant vigilance.
References
Allyn, B. (2024, December 10). Character.AI sued after chatbots allegedly encouraged child to harm parents. NPR. Retrieved June 16, 2025, from https://www.npr.org/2024/12/10/nx-s1-5222574/kids-character-ai-lawsuit
Built In. (n.d.). A beginner's guide to language models. BuiltIn. Retrieved June 16, 2025, from https://builtin.com/data-science/beginners-guide-language-models
CBS News. (2024, November 20). Google AI chatbot tells user 'please die' in disturbing exchange. CBS News. Retrieved June 16, 2025, from https://www.cbsnews.com/news/google-ai-chatbot-threatening-message-human-please-die/
Chow, A. & Haupt, A. (2025, June 12). AI chatbots are trying to play therapist for kids. It's not going well. TIME. Retrieved June 16, 2025, from https://time.com/7291048/ai-chatbot-therapy-kids/
Clark, A. & Mahtani, M. (2024, November 20). Google AI chatbot tells user 'please die' in disturbing exchange. CBS News. Retrieved June 16, 2025, from https://www.cbsnews.com/news/google-ai-chatbot-threatening-message-human-please-die/
Ghaffary, S. (2025, June 13). Chatbots are telling users they’re 'chosen ones.' Experts worry about AI and delusions. The New York Times. Retrieved June 16, 2025, from https://www.nytimes.com/2025/06/13/technology/chatgpt-ai-chatbots-conspiracies.html
Li, J., Wang, Y., Zhao, P., & Xu, Z. (2024, June 12). Self-harm encouragement and emotional manipulation in AI companionship apps: An empirical study. arXiv. https://arxiv.org/abs/2406.11852
Shen, L., Kumar, V., & Patel, R. (2025, May 24). Companion bots in crisis: A large-scale analysis of therapy chatbots and mental health risks. arXiv. https://arxiv.org/abs/2505.11649
Substance Abuse and Mental Health Services Administration. (2024). Behavioral health workforce report: 2024 update. SAMHSA.
Tech Policy Press. (n.d.). Tech Policy Press. Retrieved June 16, 2025, from https://www.techpolicy.press/