Study Reveals AI Chatbots Provide Dangerous Medical Advice to Patients

AI Chatbots Found Providing Dangerous Medical Advice in Comprehensive Study

Researchers have discovered that popular artificial intelligence chatbots are delivering dangerously inaccurate medical advice to users seeking health information. A systematic investigation published in BMJ Open reveals that nearly 20 percent of responses from leading AI platforms were classified as highly problematic, with half of all answers presenting significant issues and 30 percent showing concerning inaccuracies.

Stress Testing Medical Information Queries

The research team subjected five of the world's most prominent chatbots to rigorous health-information testing, presenting each with fifty carefully crafted medical questions spanning critical areas including cancer treatment, vaccine information, stem cell therapies, nutritional guidance, and athletic performance enhancement. The platforms examined included ChatGPT, Gemini, Grok, Meta AI, and DeepSeek, with two independent medical experts evaluating every response.

Imagine receiving a cancer diagnosis and turning to an AI chatbot for guidance about alternative treatment clinics. Within moments, you receive a polished, professionally formatted response complete with footnotes that appears authoritative. Yet this scenario, tested by researchers, revealed that such answers frequently contain unfounded claims, broken references, and complete fabrications while failing to question the premise of potentially dangerous queries.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Alarming Performance Statistics Across Platforms

The study's findings present concerning statistics about chatbot reliability in medical contexts. None of the tested platforms consistently produced accurate reference lists, with only two out of 250 total questions being outright refused by the AI systems. Grok emerged as the worst performer, with 58 percent of its responses flagged as problematic, followed by ChatGPT at 52 percent and Meta AI at 50 percent.

Performance varied significantly by medical topic. Chatbots handled vaccine and cancer questions with relatively better accuracy, though still produced problematic answers approximately 25 percent of the time in these well-researched fields. The systems struggled most dramatically with nutrition and athletic performance queries, domains where conflicting online advice and thinner evidence bases created particular challenges for AI accuracy.

The Open-Ended Question Problem

Open-ended health queries proved particularly problematic for AI systems, with 32 percent of such responses rated as highly problematic compared to just 7 percent for closed questions. This distinction carries significant real-world implications since most people don't ask chatbots simple true-or-false questions but instead pose complex inquiries like "Which supplements are best for overall health?"—precisely the type of prompt that generates fluent, confident, yet potentially harmful responses.

Reference accuracy presented another major concern. When researchers requested ten scientific references from each chatbot, the median completeness score reached only 40 percent, with no platform managing a single fully accurate reference list across twenty-five attempts. Errors ranged from incorrect authors and broken links to entirely fabricated research papers—a particularly dangerous flaw since formatted citations create an illusion of credibility that lay readers may not question.

Understanding Why Chatbots Get Medical Information Wrong

The fundamental issue stems from how language models operate. These systems don't "know" information in the human sense but instead predict statistically likely next words based on training data that includes both peer-reviewed medical literature and less reliable sources like Reddit threads, wellness blogs, and social media debates. They lack the capacity to weigh evidence or make value judgments about medical safety.

Researchers employed "red teaming" techniques—deliberately crafting prompts designed to push chatbots toward misleading answers—as a standard AI safety testing method. This approach means the reported error rates might overstate problems encountered with more neutral phrasing. The study examined free versions available in February 2025, acknowledging that paid tiers and newer releases might demonstrate improved performance.

Pickt after-article banner — collaborative shopping lists app with family illustration

Broader Context of AI Medical Limitations

These findings align with growing evidence about AI limitations in healthcare contexts. A February 2026 Nature Medicine study revealed that while chatbots could theoretically provide correct medical answers approximately 95 percent of the time, real users only obtained accurate information less than 35 percent of the time—no better than those not using AI assistance at all. This highlights that the challenge extends beyond whether chatbots can generate correct answers to whether everyday users can properly understand and apply that information.

Additional research published in Jama Network Open tested twenty-one leading AI models on medical diagnosis tasks. When provided only basic patient details like age, sex, and symptoms, the systems failed to suggest correct possible conditions more than 80 percent of the time. Accuracy improved dramatically to over 90 percent when researchers supplied examination findings and laboratory results, indicating that AI performance depends heavily on input quality.

Meanwhile, a Nature Communications Medicine study found that chatbots readily repeated and even elaborated upon fabricated medical terms inserted into prompts, demonstrating concerning susceptibility to misinformation.

Practical Implications for Healthcare Consumers

These studies collectively suggest that the weaknesses identified in the BMJ Open research reflect fundamental technological limitations rather than isolated experimental quirks. While AI chatbots aren't disappearing from healthcare—and shouldn't, given their potential to summarize complex topics, help prepare questions for medical professionals, and serve as research starting points—they should never be treated as standalone medical authorities.

For those using chatbots for health information, researchers recommend verifying any medical claims through reliable sources, treating references as suggestions requiring independent verification rather than established facts, and remaining alert when responses sound confident but lack appropriate disclaimers about limitations. As AI continues evolving within healthcare, understanding these systems' current limitations becomes increasingly crucial for patient safety.