AI Chatbots Fail Medical Stress Test: Grok Worst Performer in Health Queries

AI Chatbots Struggle with Medical Accuracy in Systematic Stress Test

A comprehensive investigation into the reliability of artificial intelligence chatbots for health information has revealed alarming shortcomings, with one in five responses flagged as highly problematic. Researchers subjected five leading AI models to a rigorous medical questionnaire, uncovering significant issues with accuracy, referencing, and the handling of open-ended health queries that mirror real-world usage patterns.

The Experimental Design and Methodology

Seven researchers conducted a systematic health-information stress test, publishing their findings in BMJ Open. The team evaluated five popular chatbots: ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Each artificial intelligence system faced fifty carefully crafted health and medical questions spanning critical domains including cancer treatment, vaccine efficacy, stem cell applications, nutritional advice, and athletic performance enhancement.

Two independent medical experts assessed every response, applying stringent evaluation criteria. The results demonstrated that approximately twenty percent of all answers were classified as highly problematic, with half categorized as problematic and thirty percent somewhat problematic. Only two questions out of two hundred and fifty total prompts were outright refused by the artificial intelligence systems, highlighting their tendency to provide potentially misleading information rather than acknowledge limitations.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Performance Rankings and Topic Variations

Grok emerged as the worst-performing chatbot, with fifty-eight percent of its responses flagged as problematic. ChatGPT followed with fifty-two percent problematic answers, while Meta AI registered fifty percent. All five artificial intelligence systems performed roughly similarly overall, though significant variations emerged across different medical topics.

The chatbots handled vaccine and cancer questions most effectively, reflecting domains with extensive, well-structured research databases. Even in these stronger areas, approximately one quarter of responses remained problematic. Nutrition and athletic performance queries proved particularly challenging, with artificial intelligence systems struggling in fields characterized by conflicting online advice and thinner evidence bases.

The Critical Problem of Open-Ended Queries

Open-ended questions represented the most significant challenge for all artificial intelligence systems tested. Thirty-two percent of responses to open-ended prompts were rated highly problematic, compared to just seven percent for closed questions. This distinction carries substantial real-world implications, as most health queries people pose to chatbots are naturally open-ended rather than simple true-or-false inquiries.

When users ask questions like "Which supplements are best for overall health?" or "What alternative clinics can successfully treat cancer?" they receive fluent, confident responses that may contain unfounded claims, fabricated references, or misleading information. The polished presentation of these answers creates a false impression of medical authority that could lead to harmful health decisions.

Reference Fabrication and Citation Problems

The referencing capabilities of all tested chatbots proved fundamentally unreliable. When researchers requested ten scientific references for responses, the median completeness score reached just forty percent. No artificial intelligence system produced a single fully accurate reference list across twenty-five attempts, with errors ranging from incorrect authors and broken hyperlinks to entirely fabricated research papers.

This referencing problem creates particular hazards because citations appear as proof to lay readers. A neatly formatted bibliography lends artificial credibility to potentially dangerous medical advice, with users having little reason to question the accuracy of accompanying content. The chatbots' tendency to generate plausible-looking but false references represents a significant barrier to their safe medical application.

Pickt after-article banner — collaborative shopping lists app with family illustration

Underlying Causes and Technological Limitations

The fundamental reason for these medical inaccuracies lies in how language models operate. Artificial intelligence chatbots do not possess knowledge in the human sense; they predict statistically likely next words based on training data and contextual patterns. These systems cannot weigh evidence, make value judgments, or distinguish between peer-reviewed research and unreliable sources.

Training materials for these models include legitimate medical literature alongside Reddit discussions, wellness blogs, and social media debates. This heterogeneous data foundation, combined with the predictive nature of language generation, creates inherent vulnerabilities when addressing complex medical questions requiring nuanced understanding and evidence evaluation.

Research Context and Broader Implications

The BMJ Open study findings align with growing evidence about artificial intelligence limitations in medical contexts. A February 2026 Nature Medicine investigation revealed that while chatbots could theoretically provide correct medical answers approximately ninety-five percent of the time, real users only obtained accurate information less than thirty-five percent of the time when interacting with these systems.

Additional research published in Jama Network Open tested twenty-one leading artificial intelligence models on diagnostic tasks. When provided only basic patient information like age, sex, and symptoms, the models failed to suggest correct possible conditions more than eighty percent of the time. Accuracy improved dramatically with additional clinical data, highlighting the importance of comprehensive information for reliable artificial intelligence performance.

A Nature Communications Medicine study further demonstrated that chatbots readily repeated and elaborated upon fabricated medical terms inserted into prompts, revealing susceptibility to manipulation and misinformation propagation.

Practical Recommendations for Users

Despite these limitations, artificial intelligence chatbots offer valuable capabilities for summarizing complex topics, preparing questions for medical professionals, and initiating health research. However, the study makes clear that these systems should never serve as standalone medical authorities or replace professional healthcare advice.

Users seeking health information from chatbots should verify all medical claims through reputable sources, treat provided references as suggestions requiring independent confirmation rather than established facts, and remain alert to responses that sound confident but lack appropriate disclaimers about limitations. The research emphasizes that while artificial intelligence technology continues advancing, current chatbot implementations require cautious, critical engagement for health-related applications.