GPT-5 outperforms human experts in medicine

A study from Emory University shows that GPT-5, the latest model from OpenAI, has achieved results surpassing human experts in multimodal medical reasoning. It is not just a matter of performance—the challenge now is to understand how to integrate this tool safely and effectively into clinical practice and the pharmaceutical industry.

Cristiana Bernini

August 18, 2025

312

On August 13, 2025, researchers at Emory University published a study on arXiv that made a strong impression on the scientific community. Capabilities of GPT-5 on Multimodal Medical Reasoning documents how GPT-5, without any specific training in the medical field, was able to tackle complex exams with a level of accuracy never seen before.

The breakthrough lies not only in the final score but in the model’s ability to integrate heterogeneous information—reports, clinical data, images—and reconstruct a logical thread leading to coherent diagnoses and decisions. Not an algorithm that merely “remembers,” but an assistant that reasons.

Methodology – How it was tested

Per testare le capacità di GPT-5, i ricercatori hanno adottato un protocollo standardizzato che riduce al minimo i margini di manipolazione. Il modello ha lavorato in zero-shot chain-of-thought: nessun addestramento su casi simili, solo la richiesta di “pensare passo dopo passo”.

To assess GPT-5’s capabilities, researchers adopted a standardized protocol designed to minimize any margin of manipulation. The model was evaluated in zero-shot chain-of-thought mode: no training on similar cases, only the prompt to “think step by step.”

Well-established datasets in the field were used:

MedQA, containing U.S. medical licensing exam questions
MMLU-Medical, which assesses specialized medical knowledge
USMLE self-assessment, practical tests from the three stages of medical licensing
MedXpertQA, with more than 4,400 questions across 17 specialties, also in multimodal form
VQA-RAD, focused on radiological images

This approach allows the model to be evaluated under conditions comparable to human peers and to earlier versions such as GPT-4o.

Results – Numbers that surprise

The data speak for themselves:

On MedQA, GPT-5 achieved 95.8% accuracy (+4.8% compared to GPT-4o).
On the text-based MedXpertQA, the leap was striking: +26.3% in reasoning and +25.3% in comprehension.
On the USMLE, the model scored an average of 95.2%, with Step 2—the most clinically oriented—up by 4.1%.
On multimodal MedXpertQA, the most complex challenge, it outperformed human experts: +24.2% in reasoning and +29.4% in comprehension.

In one of the most compelling clinical cases, GPT-5 correctly diagnosed an esophageal perforation (Boerhaave’s syndrome), identifying the clinical signs and recommending a Gastrografin exam as the first step. Not only did it get the diagnosis right, but it also clearly explained why the alternative options were inappropriate.

Beyond the numbers – What it really means

The point is not that a model “beats” a doctor on a test. It is that GPT-5 demonstrates multimodal reasoning skills that make it, at least in standardized simulations, more reliable than a human professional.

But researchers remain cautious: these evaluations take place in controlled settings, with clean data and well-structured questions. Clinical reality is made of uncertainties—patients who do not tell the whole story, imperfect images, emergencies where time is scarce. In those contexts, AI is not yet ready to replace human experience.

Ethical and regulatory challenges

If a model outperforms physicians on paper, who authorizes it to step into the clinic?

Clinical validation – As with a drug, prospective and comparative studies are needed.
Clear rules – EMA and FDA are drafting guidelines, but a “superhuman” AI makes acceleration unavoidable.
Transparency – GPT-5 may deliver results, but its internal reasoning remains opaque.
Accountability – If an algorithm misdiagnoses, who bears responsibility?

These questions concern not only developers but also pharmaceutical companies that choose to integrate such tools into their pipelines.

What changes for the pharmaceutical industry

The potential for the sector is enormous:

Research – GPT-5 can combine genomic, structural, and clinical data to identify promising new molecules more quickly.
Clinical trials – Automating the analysis of radiological images or extracting data from thousands of medical records can cut timelines from months to hours.
Pharmacovigilance – Cross-referencing medical reports, literature, and real-world data enables earlier detection of adverse events.
Digital companions – Multimodal models capable of supporting chronic patients by interpreting symptoms and biometric data in real time.

This is not “industrial science fiction”: these are concrete applications that many companies are already exploring with earlier versions of GPT. GPT-5, with its multimodal reasoning capacity, can bring them to full maturity.

Conclusion – An assistant that forces us to rethink the future

The Emory University study does not suggest that physicians are obsolete, but that—for the first time—an AI model outperforms them in standardized tests. It is a powerful signal, one that compels both industry and regulators to address the issue not in abstract terms, but with urgency.

GPT-5 is not the end of medicine, but a new actor on stage. It is up to us to decide whether it becomes an ally capable of reducing errors, accelerating research, and improving patients’ lives, or a tool left on the sidelines out of fear of its risks.

The real challenge is no longer technical. It is cultural, regulatory, and ethical.

Fonte: Wang S, Hu M, Li Q, Safari M, Yang X. Capabilities of GPT-5 on multimodal medical reasoning. arXiv [Preprint]. 2025 [cited 2025 Aug 16]; Available from: https://arxiv.org/abs/2508.08224