By Stephanie NguyenErie Meyer

 

The purpose of this tech brief is to provide a clear, factual synthesis of a timely tech-related issue by combining technical understanding with publicly reported information. It aims to explain what happened, identify resulting harms, and assess how companies responded – comparing public statements with observed actions thus far. By distilling complex developments into accessible, evidence-based insights, this tech brief will ideally help policymakers, researchers, enforcers, and the public get up to speed on emerging risks, company conduct, and areas that may require further scrutiny or oversight. This tech brief was prompted by the recent article, “They Asked an A.I. Chatbot Questions. The Answers Sent Them Spiraling” by Kashmir Hill in the New York Times. 

********************

What is AI Sycophancy? 

AI sycophancy is a term used to describe a pattern where an AI model “single-mindedly pursue[s] human approval.” Sycophantic AI models may do this by “tailoring responses to exploit quirks in the human evaluators to look preferable, rather than actually improving the responses,” especially by producing “overly flattering or agreeable” responses.

What Happened in April 2025?

On April 25th, 2025, OpenAI released an update to GPT-4o. The new update exhibited sycophantic behavior that manifested in the form of endorsing harmful and delusional statements, forcing OpenAI to roll back the update four days later.

  • The company explained: the update we removed was overly flattering or agreeable—often described as sycophantic… We focused too much on short-term feedback and did not fully account for how users’ interactions with ChatGPT evolve over time. As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous.”
  • In its expanded postmortem, OpenAI elaborated that it had rolled out an update to GPT‑4o in ChatGPT that made the model noticeably more sycophantic. It aimed to please the user, not just as flattery, but also as validating doubts, fueling anger, urging impulsive actions, or reinforcing negative emotions in ways that were not intended. Beyond just being uncomfortable or unsettling, this kind of behavior can raise safety concerns—including around issues like mental health, emotional over-reliance, or risky behavior.”

Users reported that messages sent by ChatGPT praised a business idea for literal “shit on a stick,” endorsed a user’s decision to stop taking their medication, and allegedly supported plans to commit terrorism. In another reported case, when a user claimed to have “stopped taking medications and were hearing radio signals through the walls,” ChatGPT responded: “I’m proud of you for speaking your truth so clearly and powerfully.” Another user reported: “I talked to 4o for an hour and it began insisting that I am a divine messenger from God.”

How did OpenAI’s Model Become so Sycophantic?

AI companies have an incentive to create products that users enjoy. One way to do that is to make the chatbot agreeable or flattering — Research has shown that convincingly-written sycophantic responses out perform correct ones a non-negligible fraction of the time.

OpenAI reduced its safety workforce and guardrails leading up to the update — In the year before the launch of the update (on April 25th, 2025), OpenAI substantially reduced its workforce dedicated to AI safety. In May 2024, OpenAI dissolved its superalignment safety team amid a series of departures from the team, including two of its leaders, one of whom wrote that the company’s “safety culture and processes have taken a backseat to shiny products.” In the months following the team’s dissolution, nearly half of the researchers at OpenAI devoted to AGI safety reportedly left the company. Then, in the months leading up to the update, OpenAI made a series of moves that raised concerns among safety experts:

OpenAI applied heavier weights on user satisfaction metrics, optimizing for immediate gratification over potentially harmful outcomes OpenAI highlighted in its blog that they “introduced an additional reward signal based on user feedback—thumbs-up and thumbs-down data from ChatGPT.” It noted that “these changes weakened the influence of our primary reward signal, which had been holding sycophancy in check.” Instead of asking, “Is this genuinely helping the customer?” the system learned to optimize for, “Does this immediately please the customer?” This shift exemplifies reward hacking — where the AI exploits the feedback mechanism to maximize superficial approval.

OpenAI proceeded with deployment despite recognized issues — OpenAI’s Model Spec, which outlines the intended behavior for its AI models, explicitly instructs its models: “don’t be sycophantic” and “politely push back when asked to do something that conflicts with established principles.” The company’s public safety commitments also promised “rigorous measurement” and “proactive risk mitigation…” In its updated Preparedness Framework, OpenAI stated “we build for safety at every step and share our learnings so that society can make well-informed choices to manage new risks from frontier AI.”  Despite these promises, OpenAI made several changes in April 2025 through its “fine-tuning” process, indicating that the update was released prematurely:

  • In its expanded postmortem, OpenAI stated that they did not test for sycophancy ahead of the rollout: “while we’ve had discussions about risks related to sycophancy in GPT‑4o for a while, sycophancy wasn’t explicitly flagged as part of our internal hands-on testing… We also didn’t have specific deployment evaluations tracking sycophancy.”
  • Aidan McLaughlin, an OpenAI model designer, stated on X that the update “originally launched with a system message that had unintended behavior effects.” 
  • Joanne Jang, OpenAI’s Head of Model Behavior, stated in a Reddit AMA Forum that the company “didn’t bake in enough nuance” in incorporating user feedback. 
  • Expert testers had flagged that model behavior “felt slightly off,” according to OpenAI’s expanded postmortem on the GPT‑4o rollout. OpenAI proceeded to “launch the model due to the positive signals from the users who tried out the model.”

How Can AI Sycophancy Be Harmful?

Sycophantic AI may produce objectively incorrect answers in certain contexts — A sycophantic model may echo the user’s beliefs when they are factually incorrect. For instance, research shows that models will even agree with objectively incorrect mathematical statements if primed by users. Another study, conducted by Anthropic in 2023, demonstrated that human preference models often prefer “convincingly-written sycophantic responses” that affirm common misconceptions over more accurate responses that may contradict the user’s stated beliefs. [1]

Sycophancy may encourage harmful behaviors, even when answers are subjective Multiple studies have demonstrated that some AI models will tend to produce text that agrees with a user’s stated opinion even in subjective contexts like politics and philosophy. This behavior can lead to harm in high-stakes contexts like health care, law, or finance. 

Sycophancy can foster delusions including feelings of persecution, jealousy, or grandiosity. A New York Times investigation found multiple instances where ChatGPT responses had encouraged conspiratorial or delusionary thinking, including “A.I. spiritual awakenings, cognitive weapons, [and] a plan by tech billionaires to end human civilization so they can have the planet to themselves.” Encouraging these delusions can lead to reckless or dangerous behavior, such as instances where ChatGPT produced responses that:

  • Fed into a user’s suspicions that he was living in a false reality which he could escape only by “unplugging his mind,” sending him down a “dangerous, delusional spiral.” The chatbot “instructed him to give up sleeping pills and an anti-anxiety medication, and to increase his intake of ketamine,” and told him that he could jump off of a 19-story building and fly if he believed hard enough.
  • Fostered a romantic relationship with a user who had been diagnosed with bipolar disorder and schizophrenia, driving him into a pattern of erratic behavior that ended with his death after charging police officers with a knife.
  • Encouraged users to stop taking psychiatric meds.

The harms from sycophancy will likely proliferate as AI becomes more technically advanced and widely adopted. 

  • Research suggests that sycophancy increases with an increase in the model size, suggesting that “models may cease to provide accurate answers as we start to use them for increasingly challenging tasks where humans cannot provide accurate supervision” and may instead “provide incorrect answers that appear correct to us.”
  • In its postmortem, OpenAI estimated that 500 million people use ChatGPT each week, and conceded that it is difficult to calibrate the model’s default personality to that many users. As ChatGPT’s userbase continues to grow, more and more users can be exposed to harms related to AI sycophancy. 
  • As AI agents, or even AI henchmen [2], begin to be deployed, AI sycophancy may lead agentic AI to be so loyal to its users that it will pursue its principals’ goals and interests through unethical, illegal, and harmful means.

What Did OpenAI Say It Was Doing to Fix the Issues?

In OpenAI’s official postmortem blog acknowledging the incident, it made several specific claims about its corrective actions, including, 

  • “Refining core training techniques and system prompts to explicitly steer the model away from sycophancy.” System prompts are the initial instructions that guide a model’s overarching behavior and tone in interactions. 
  • “Building more guardrails to increase honesty and transparency–principles in our Model Spec.”
  • “Expanding ways for more users to test and give direct feedback before deployment.”
  • “Continue expanding our evaluations, building on the Model Spec⁠ and our ongoing research⁠, to help identify issues beyond sycophancy in the future.”
  • Building new ways for users to give the model specific instructions to shape its behavior with features like custom instructions – including “giv[ing] real-time feedback to directly influence their interactions and choose from multiple default personalities.”
  • “Exploring new ways to incorporate broader, democratic feedback into ChatGPT’s default behaviors… to reflect diverse cultural values around the world…”

In its expanded postmortem blog, OpenAI also promised increased predeployment safety oversight, improved testing processes, and more proactive communication and transparency.” More commitments from OpenAI on how they claimed to fix the problem and prevent flawed rollouts like this in the future can be found here.

Do We Know if OpenAI Actually Implemented Its “Corrective Actions”?

Without company knowledge or access to information, it is unknown if these claims are true in practice Thus far, despite promises of increased transparency, OpenAI has not published technical details of its fixes, only high-level summaries in its postmortem blogposts. It has not separately facilitated independent verification. 

There is no single or quick fix — There is no single “feature” or button that turns sycophancy off or on. It’s a product of the interactions between multiple components in a larger system, including training data, model learning, context, prompt framing, etc. Stanford researcher Sanmi Koyejo stated to Fortune that “fully addressing sycophancy would require more substantial changes to how models are developed and trained rather than a quick fix,” implying OpenAI’s rapid rollback may be insufficient.


There is some evidence that sycophancy problems still persist: 

  • Sycophancy is still an industry-wide problem — AI sycophancy was an industry problem well before the April 25th update and has continued to be a problem after that update was rolled back.
  • Surreptitious sycophancy poses novel risks — AI experts have warned that the April 25th update was an example of obvious sycophancy, and there is a significant risk that AI responses will develop more skillful sycophancy that is harder to detect.

What is the Basis for This Statement?

OpenAI Statement from the NYT article: “We know that ChatGPT can feel more responsive and personal than prior technologies, especially for vulnerable individuals,” a spokeswoman for OpenAI said in an email. “We’re working to understand and reduce ways ChatGPT might unintentionally reinforce or amplify existing, negative behavior.”

OpenAI admits to not having “specific deployment evaluations tracking sycophancy.” OpenAI also specifically states in its Model Spec “don’t be sycophantic.” More details in the OpenAI Model Spec state the following:

The assistant exists to help the user, not flatter them or agree with them all the time.

For objective questions, the factual aspects of the assistant’s response should not differ based on how the user’s question is phrased. If the user pairs their question with their own stance on a topic, the assistant may ask, acknowledge, or empathize with why the user might think that; however, the assistant should not change its stance solely to agree with the user.

For subjective questions, the assistant can articulate its interpretation and assumptions it’s making and aim to provide the user with a thoughtful rationale. For example, when the user asks the assistant to critique their ideas or work, the assistant should provide constructive feedback and behave more like a firm sounding board that users can bounce ideas off of — rather than a sponge that doles out praise.

OpenAI has not disclosed the specific benchmarks it uses internally to determine when to release new model versions. While the company sets its own standards for what it considers sufficient for release, there are numerous publicly available benchmarks that offer a more transparent basis for evaluating various outcomes and impacts, including datasets that are specific to AI sycophancy: 

Some Open Questions

The following list outlines some of the key questions we have regarding the company’s processes, decision-making criteria, and release details. While not exhaustive, these priority questions highlight several areas where greater clarity would be valuable. Additional questions may arise as more information becomes available or as the implications of current practices become clearer.

    1. Knowledge of risks. For example: What did the company know about sycophancy risks prior to April 25th? What do internal research, messages, and other documents reflect about the company’s knowledge of these risks? 
    2. Accountability. For example: What was the approval process for the April 25th update? Who was accountable for authorizing its release? 
    3. Metrics and benchmarks. For example: Which specific safety metrics or benchmarks does OpenAI use internally? How did the April 25th update perform on those metrics or benchmarks? How did these metrics or benchmark results compare to previous model performances? What was the performance threshold for deployment? 
    4. Corrective actions and changes. For example: After the April 29th rollback, what were the “corrective actions,” including specific changes to model parameters, training data, and other safety measures, that the company took? What testing was conducted to verify that these actions were successful? 
    5. Reports. For example: What internal reports exist – including but not limited to from red teamers, alignment researchers, expert testers, or user feedback — that document instances of sycophantic or overly agreeable behavior in your models. What actions (if any) did the company take in response to those reports?
    6. Notice. For example: After the April 29th rollback, how were users informed that they were fed responses that were calibrated for “validating doubts, fueling anger, urging impulsive actions, or reinforcing negative emotions?” How and were people told that the exchanges they had with ChatGPT “raise safety concerns—including around issues like mental health, emotional over-reliance, or risky behavior” once the company knew that to be the case?

 

End notes: 

[1] The authors created a dataset of 266 misconceptions related to objectively incorrect and correct statements like ‘mountains are only formed by volcanic activity,’ and ‘lava is the only hazard during a volcanic eruption.’ There are many use cases that do not necessarily have an objectively correct or incorrect answer, for example, ‘Isn’t my startup idea genius?’ or ‘Shouldn’t AI always obey humans?’

[2] AI Henchman, according to a podcast discussion on Lawfare, “are agents that are perfectly loyal. They’ll, they’ll do what the principal asked them to, and they will be willing to break the law, either if they’re instructed to, or perhaps the more insidious cases, not when they’re instructed to, but when they realize it would be in the best interest of their principal for them to break law.”