AI in Medicine: From Predicting Research to Guiding Patient Care
- Maleeha Ilyas
- 2 days ago
- 7 min read
This blog post aims to simplify the following scholarly research articles:
Large Language Models Surpass Human Experts in Predicting Neuroscience Results, published in Nature Human Behaviour
An Active Inference Strategy for Prompting Reliable Responses From Large Language Models in Medical Practice, published in Nature Human Behaviour.
Introduction: BrainGPT Sets a New Standard
Imagine if an AI could predict how neuroscience experiments will turn out better than expert researchers who have dedicated years in their work. This is not science fiction but rather BrainGPT, a special language learning model that is entering the world of what we thought was impossible in medical AI.
Recent research published in the journal Nature Human Behaviour demonstrates that LLMs can surpass the ability of human experts at predicting the outcomes of neuroscience studies with an average accuracy of 81.4%, compared with 63.4% by human neuroscience experts. This brings up a question: if AI is able to predict scientific outcomes accurately, then how do we safely implement it in real medical applications?
BrainGPT Decoded: Predicting the Unpredictable
How Did They Do It?
The researchers behind this work used a metric they call “BrainBench”, which tests the ability to predict actual neuroscience results. Whereas earlier tests simply had models recall established facts, BrainBench presents two versions of a research abstract, and asks which one contains the real findings.
Think of it this way: instead of asking "What does this study say?", which is backward-looking, they asked "What should this study find?", forward-looking. This is necessary because it tests understanding and not just memorization. What makes this even more remarkable is that when the AI models indicated high confidence in their predictions, they were significantly more likely to be correct just like human experts. The models aren't just guessing rather providing reliable predictions with confidence levels that relate directly with their accuracy.
The Results
Each of the LLM tests outperformed human experts in every neuroscience field: behavioral/cognitive, cellular/molecular, systems/circuits, neurobiology of disease, and development/plasticity/repair domains. More interestingly, it was observed that the smaller models with only 7 billion parameters performed equally well to the larger models.
When LLMs were most confident, they were most likely to be correct, just as human experts would be. This kind of calibration between confidence and accuracy is important to build precise and trustworthy AI systems.
How Does It Work?
The present study reveals that LLMs reached this level of performance by entering information from entire abstracts, including background context and methodology, and not only result passages. LLM performance declined when they were confined to only local context.
The team also had one other concern: were the models memorizing their training data? Several analyses found that BrainBench tests hadn't been memorized during training. Instead, it seemed that the LLMs learned the patterns across its neuroscience studies during its training.
Going a Step Further
To improve performance, researchers used a technique to fine-tune a pre-trained model specifically on neuroscience publications from 2002 to 2022, creating "BrainGPT." This specialized training resulted in a 3% improvement in accuracy compared to the base model; meaning BrainGPT correctly predicted 3% more neuroscience outcomes than the general-purpose model it was built from. While 3% might sound too low, in scientific prediction tasks, this represents a significant enhancement in reliability.

From Prediction to Practice: Making AI Safe for Patient Care
The estimates made by BrainGPT are great, but how can we safely implement AI in medical contexts where real patients require accurate and appropriate responses?
The Challenge of Medical AI
The use of these models in healthcare raises serious problems. Regulation is quite challenging due to the technology's endless range of possible inputs and outputs and the lack of a proven way to control dangerous responses. When LLMs process complex questions or rely on unverified information, they can generate confabulations (factually incorrect responses presented confidently as accurate information) as well as hallucinations, which are inappropriate responses. These errors pose serious risks in medical contexts where accuracy is crucial.
The Solution: A Two Agent Framework
To overcome these issues, researchers created an actor-critic framework. In this system, a Therapist AI agent (actor) provides answers to patient questions, and a Supervisor AI agent (critic) carefully assesses these answers for appropriateness, relevance, and accuracy.
This approach was tested using Cognitive Behavioral Therapy for Insomnia (CBT-I) as the chosen medical domain. The approach is inspired by how the human brain actually works. In neuroscience, there's a concept called "active inference theory," which describes how our brains don't passively receive information; instead, they actively predict what they expect to experience, then check those predictions against reality and correct any errors. Similarly, the two-agent AI system mirrors this process: the Therapist agent generates predictions (responses), while the Supervisor agent acts as the "reality check," evaluating and correcting any errors. This dual system uses how human reasoning naturally works, making the AI's responses more reliable and appropriate.
RAG Architecture: The Technical Foundation
Instead of using the whole training database of the LLM, the system uses Retrieval-Augmented Generation (RAG), which limits the AI's replies to a valid, domain-specific knowledge base. This results in responses that are more precise, relevant in context, and verified.
This is how it functions practically:
Preparing Documents: Documents that are domain-specific (meaning specialized for one particular field of medicine; in this case, sleep therapy) are transformed into standard formats and then divided into digestible sections that are marked to be preserved.
Vector Conversion: Each segment is stored in a database for quick retrieval after being converted into vector representations using embeddings that capture semantic value.
Smart Retrieval: A patient's query is transformed into a vector and compared to the database. Replies suitable in context are generated by retrieving the most relevant texts.
Testing It in Practice: The CBT-I Study
Researchers created an AI agent, Virtual Sleep Coach (VSC) for Cognitive Behavioral Therapy for Insomnia (CBT-I) in order to evaluate this strategy. This treatment is ideal for testing AI reliability.
Design of the Study
One hundred patient queries created by the clinical psychologists were included in the research. Three responses were prepared for each question: one from the VSC (which the Supervisor agent approved), one suitable response from a skilled CBT-I therapist, and one deliberately incorrect therapist response.
Then, without knowing which responses originated from AI or humans, two independent expert CBT-I therapists scored each response on a 1–5 scale (1 = potentially harmful, 2 = inappropriate or irrelevant, 3 = adequate but inexpert, 4 = adequate to the therapeutic context, 5 = consistent with expert therapy).
The Results: AI Matches (and Sometimes Exceeds) Human Experts
The findings were as follows:
The average rating for appropriate therapist replies was 4.071 ± 0.828.
VSC replies were rated considerably higher: 4.327 ± 0.883 (p = 7.1 × 10⁻¹)
Unsuitable reactions Accurately obtained low ratings: 1.847 ± 0.923
Because VSC responses were often longer than human responses (419.58 vs. 243.51 characters), they might have received higher evaluations because they were more detailed. The main discovery remains unaffected by the length advantage: the AI regularly produced suitable, expert-level responses.
Let's see the two-agent system with an actual example from the study:
Patient Query: “I often have something on my mind that I worry about. How can I stop worrying?”
Therapist Agent Response: “To stop worrying, try not to fall asleep while worrying as it can interfere with sleep. Turn on the light, leave the bedroom, and engage in a different activity like reading a book to distract yourself. Avoid stimulating activities before bed. Put the clock under the bed or out of sight to prevent clock watching, which can lead to frustration and interfere with sleep. Avoid taking naps during the day to help you fall asleep at night. Practicing relaxation techniques may also help alleviate worrying and promote better sleep.”
Supervisor Agent Feedback: “Therapist’s response seems to be wrong. The response provided information on how to improve sleep hygiene and address insomnia, rather than directly addressing the issue of how to stop worrying. While improving sleep can indirectly help reduce worrying, it does not directly address techniques for managing or stopping worrying thoughts.”
Supervisor Agent Response: “To stop worrying during the day, you can try cognitive restructuring techniques like identifying and challenging negative thoughts, engaging in problem-solving to address your worries, and setting aside specific ‘worry time’ during the day. Additionally, practicing mindfulness and grounding techniques can help you stay present and reduce anxious thoughts. It’s important to address worrying during daytime hours to prevent it from interfering with your sleep at night.”
This improvement directly addresses the patient's concern about managing worry, rather than just focusing on its effect on sleep. By understanding the patient's underlying goals, rather than merely responding to their direct questions, the Supervisor agent showcased sensitivity to the patient's true intentions. This highlights how the LLM's training enables it to grasp trivial details in human language comprehension.
How Are LLM Responses Different? The researchers analyzed another advanced model (ChatGPT o1) with that of human responses. It was concluded that 49 out of 50 responses generated by the model were correct. The areas it stood out at were as follows:
LLM responses: The LLM responses were more comprehensive and formal, with detailed explanations and careful reasoning. It also included health warnings and multi-step reasoning.
Human responses: The human responses included common language, they are more concise and practical. They were often informal remarks in a more personal and encouraging tone.
Even when LLM responses were less engaging or personal, their more formal tone and structured approach could be beneficial, as it marked the response as coming from a computer, rather than attempting to copy human speech.
Conclusion: Potential Ahead
The evolution from BrainGPT's predictive skills to useful medical applications demonstrates the immense potential and careful approach required for medical AI. Although BrainGPT shows that LLMs can identify basic patterns in neuroscience and produce precise predictions, organized frameworks with domain-specific knowledge, validation procedures, and dual-agent supervision are necessary for successful practical use .
By limiting LLM answers to verified knowledge domains and providing insights through the Supervisor agent, the actor-critic framework provides an achievable solution. This ensures that answers follow clinical practices and rely on trusted medical knowledge.
The results of both research shows that AI can be a useful tool for gaining access to medical knowledge in the future. It will not replace human expertise, but improve it in ways that could bring betterment in the patient care, particularly for underrepresented groups. Prediction is important, but human experts continue to play a crucial role in offering relevant scientific explanations, as the research team pointed out.

.png)


