Key Insight: Microsoft, Amazon, and OpenAI have all launched consumer health chatbots, but independent researchers warn these tools need rigorous testing before public release.
The Health AI Boom
Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users can connect their medical records and ask specific questions about their health. Amazon announced that Health AI, previously restricted to One Medical members, would now be widely available. These join ChatGPT Health (OpenAI, January) and Anthropic's Claude (which can access health records with permission).
Why Now?
Dominic King, VP of health at Microsoft AI, cites AI advancement as a core reason. But demand is equally important: Microsoft receives 50 million health questions daily, making health the most popular topic on Copilot.
Girish Nadkarni, chief AI officer at Mount Sinai Health System, notes: "Access to health care is hard, and it's particularly hard for certain populations."
The Critical Gap: Independent Testing
A recent study from Mount Sinai found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies. Most academic experts agree these tools could have real upsides—but all six interviewed expressed concerns that they're being launched without testing from independent researchers.
The Evaluation Problem
Companies say they're testing chatbots to ensure safe responses. OpenAI released HealthBench, a benchmark scoring LLMs on health conversations—but the conversations themselves are LLM-generated.
Andrew Bean, a doctoral candidate at Oxford Internet Institute, found that even if an LLM can accurately identify a medical condition from a written scenario, non-expert users given the same scenario with LLM assistance might figure it out only a third of the time.
What Works
Google released a study meeting Bean's standards. In that study, patients discussed medical concerns with Google's Articulate Medical Intelligence Explorer (AMIE) before meeting with a human physician. Overall, AMIE's diagnoses were just as accurate as physicians', and none of the conversations raised major safety concerns.
Despite the encouraging results, Google isn't planning to release AMIE anytime soon. "While the research has advanced, there are significant limitations that must be addressed before real-world translation," wrote Alan Karthikesalingam from Google DeepMind.
The Path Forward
Adam Rodman, who led the AMIE study, doesn't think extensive multiyear studies are necessarily the right approach: "There's lots of reasons that the clinical trial paradigm doesn't always work in generative AI. That's where this benchmarking conversation comes in. Are there benchmarks from a trusted third party that we can agree are meaningful?"