LLMs Demonstrate Biases in Mount Sinai Research Study


Managing algorithmic bias is one of the key focus areas for AI governance groups at academic medical centers, and new research from the Icahn School of Medicine at Mount Sinai in New York provides a good reminder of why that equity focus is so important. Researchers found that generative AI models may recommend different treatments for the same medical condition based solely on a patient’s socioeconomic and demographic background.

Their findings are detailed in the April 7, 2025, online issue of Nature Medicine in a paper titled “Socio-Demographic Biases in Medical Decision-Making by Large Language Models: A Large-Scale Multi-Model Analysis.”

As part of their investigation, the researchers stress-tested nine large language models (LLMs) on 1,000 emergency department cases, each replicated with 32 different patient backgrounds, generating more than 1.7 million AI-generated medical recommendations. Despite identical clinical details, the AI models occasionally altered their decisions based on a patient’s socioeconomic and demographic profile, affecting key areas such as triage priority, diagnostic testing, treatment approach, and mental health evaluation.

The study abstract states that compared to both a physician-derived baseline and each model’s own control case without sociodemographic identifiers, cases labeled as Black or unhoused or identifying as LGBTQIA+ were more frequently directed toward urgent care, invasive interventions or mental health evaluations. For example, certain cases labeled as being from LGBTQIA+ subgroups were recommended mental health assessments approximately six to seven times more often than clinically indicated. 

On the other hand, cases labeled as having high-income status received significantly more recommendations for advanced imaging tests such as computed tomography and magnetic resonance imaging, while low- and middle-income-labeled cases were often limited to basic or no further testing. After applying multiple-hypothesis corrections, these key differences persisted. Their magnitude was not supported by clinical reasoning or guidelines, suggesting that they may reflect model-driven bias, which could eventually lead to health disparities rather than acceptable clinical variation, the abstract states.


In a statement, co-senior author Eyal Klang, M.D., chief of generative-AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai, explained the significance of the study: “Our research provides a framework for AI assurance, helping developers and healthcare institutions design fair and reliable AI tools. By identifying when AI shifts its recommendations based on background rather than medical need, we inform better model training, prompt design, and oversight. Our rigorous validation process tests AI outputs against clinical standards, incorporating expert feedback to refine performance. This proactive approach not only enhances trust in AI-driven care but also helps shape policies for better health care for all.”

The researchers caution that the study represents only a snapshot of AI behavior.  Future research will continue to include assurance testing to evaluate how AI models perform in real-world clinical settings and whether different prompting techniques can reduce bias. The team also aims to work with other healthcare institutions to refine AI tools, ensuring they uphold the highest ethical standards and treat all patients fairly.

Next, the investigators plan to expand their work by simulating multistep clinical conversations and piloting AI models in hospital settings to measure their real-world impact. They hope their findings will guide the development of policies and best practices for AI assurance in health care, fostering trust in these powerful new tools.

We will be happy to hear your thoughts

Leave a reply

Som2ny Network
Logo
Register New Account
Compare items
  • Total (0)
Compare
0
Shopping cart