Optimized Context Retrieval Enables Cost-Effective, High-Performance Healthcare AI with Open-Source LLMs

Community Article Published April 14, 2025

Large Language Models (LLMs) hold immense potential for healthcare applications, from aiding in medical question-answering to improving clinical decision support. However, the high operational costs and accessibility limitations of leading proprietary models often pose significant barriers, particularly in resource-constrained settings. Achieving the high levels of performance and reliability demanded by healthcare requires innovative approaches that balance capability with cost-effectiveness.

Our recent paper, "Pareto-Optimized Open-Source LLMs for Healthcare via Context Retrieval", addresses this challenge directly. We demonstrate that by strategically augmenting open-source LLMs with optimized Context Retrieval (CR) techniques, it's possible to achieve state-of-the-art performance on demanding medical benchmarks at a fraction of the cost associated with closed models.

Key contributions:

  1. Practical Guidelines for Optimized Context Retrieval: We present a reproducible pipeline, empirically validated, for configuring cost-effective context retrieval systems tailored for healthcare AI.
  2. Empirical Validation of an Improved Pareto Frontier: We provide robust empirical evidence showing that our optimized approach significantly shifts the Pareto frontier for medical question answering, enabling open-source models to operate in a new regime of high performance and efficiency.
  3. OpenMedQA: Recognizing the limitations of multiple-choice formats, we introduce OpenMedQA, a new benchmark for evaluating open-ended medical question-answering capabilities, revealing important performance characteristics.
  4. Open-Source Resources for the Community: We release our Prompt Engine library, reasoning-augmented CoT/ToT/Thinking databases, and the OpenMedQA benchmark to facilitate further research and development in the community.

Optimizing Context Retrieval for Healthcare LLMs

Our methodology builds upon established Retrieval-Augmented Generation (RAG) principles, specifically inspired by the Medprompt architecture, but focuses on optimizing components for cost-effectiveness using open-source models. The core idea is to ground the LLM's responses in relevant, high-quality external knowledge retrieved efficiently.

image/png

Here's a breakdown of key components in our optimized pipeline:

  • Choice Shuffling: A simple but effective technique to mitigate positional bias often observed in LLMs handling multiple-choice questions (e.g., LLMs favoring "Option A"), applied with minimal overhead.
  • Embedding Model: Responsible for encoding queries and database entries for semantic search. We found smaller, healthcare-specific models like PubMedBERT rival larger generalists (e.g., SFR-Mistral), offering competitive retrieval quality with lower resource demands.
  • Ensemble Refinement (Self-Consistency): Aggregates multiple reasoning paths via Self-Consistency for robust predictions. We found that using N=5 ensembles offered a strong balance between accuracy gains (~3.5% improvement over baseline) and computational cost (including CO₂ footprint).
  • High-Quality Examples as Context: The source of external context. We found that enriching the database with high-quality reasoning examples (Chain-of-Thought, Tree-of-Thought, or distilled reasoning from capable models) significantly boosts performance. Our "Thinking" database, augmented with reasoning traces from DeepSeek-R1, yielded an average accuracy improvement of 3.61% across benchmarks.

Redefining the Cost-Accuracy Pareto Frontier with Open-Source Models

The Pareto frontier represents the optimal balance between accuracy and computational cost. In healthcare, pushing this frontier towards higher accuracy and lower cost is crucial for practical deployment. Historically, the upper end of this frontier on benchmarks like MedQA has been occupied by large, costly proprietary models like GPT and Med-Palm. Our work rewrites this story, proving that open-source LLMs, powered by optimized context retrieval, can not only compete but push the frontier further.

Using our refined pipeline, we benchmarked open-source models and achieved competitive results. DeepSeek-R1, enhanced with our approach, reached over 94% accuracy on MedQA - surpassing previous state-of-the-art marks set by proprietary systems. Meanwhile, Aloe-Beta-70B, with just 70 billion parameters, hit 89% accuracy, closing the gap with larger models while maintaining a leaner computational footprint. These results establish a new efficiency standard, proving that top-tier accuracy is achievable without relying on the most expensive proprietary solutions.

image/png

Putting it in Context: Performance Across the Board

The effectiveness of our optimized context retrieval approach is consistent across various models and benchmarks. The table below shows the performance uplift when applying context retrieval to several state-of-the-art open-source LLMs.

Model CareQA MedMCQA MedQA MMLU Average
Llama-3.1-8B 69.95 59.22 63.71 75.72 67.15
with CR +6.07 +12.79 +17.36 +9.33 +11.39
Qwen2.5-7B 72.14 56.18 61.59 77.92 66.96
with CR +3.08 +13.00 +12.64 +6.13 +8.71
Aloe-Beta-8B 70.77 59.57 64.65 76.50 67.87
with CR +5.37 +12.72 +16.26 +7.60 +10.49
Llama-3.1-70B 83.72 72.15 79.73 87.45 80.76
with CR +3.15 +5.69 +9.66 +3.84 +5.54
Qwen2.5-72B 85.45 69.26 77.85 88.81 80.34
with CR +1.08 +7.55 +7.46 +2.75 +4.71
Aloe-Beta-70B 83.19 72.15 79.73 88.44 80.88
with CR +4.38 +5.28 +9.11 +3.01 +5.45
DeepSeek-R1 88.33 73.34 82.48 91.27 83.86
with CR +4.18 +8.94 +11.94 +3.61 +7.17
GPT-4 + Medprompt* - 79.10 90.20 94.2 -
MedPalm-2 + ER* - 72.30 85.40 89.40 -
O1 + TPE* - 83.90 96.00 95.28 -

*Results reported by others. ER: Ensemble Refinement (Google’s custom prompt technique). TPE: Tailored Prompt Ensemble (custom OpenAI ensemble technique).

The results show consistent and statistically significant accuracy improvements across all tested models and datasets. Notably, the magnitude of the gain often exhibits an inverse correlation with the base model's performance – smaller models tend to benefit more, with average accuracy gains frequently exceeding 10%. This highlights CR's effectiveness in compensating for the inherent knowledge limitations of smaller LLMs. Even highly capable models like DeepSeek-R1 see substantial boosts (over 7% average gain), pushing their performance closer to the theoretical ceiling.

OpenMedQA: Beyond Multiple-Choice

While multiple-choice question answering (MCQA) benchmarks are useful, real-world clinical interactions often require generating nuanced, free-text responses To address this, we introduce OpenMedQA, a novel benchmark derived from MedQA, designed to evaluate LLMs on open-ended medical question-answering. This benchmark was created by systematically rephrasing the questions from the MedQA test set into an open-ended format, while retaining the original medical intent and grounding in verified knowledge.

Our findings? Comparing model performance on MedQA (MCQA) versus OpenMedQA (OE-QA) reveals a consistent performance drop across all evaluated models when shifting to the open-ended format.

Model MedQA OpenMedQA Performance Drop
Llama-3.1-8B-Instruct 63.71 33.88 -29.82
Qwen2.5-7B-Instruct 61.59 38.76 -22.83
Llama3.1-Aloe-Beta-8B 64.65 52.91 -11.74
Llama-3.1-70B-Instruct 79.73 60.46 -19.28
Qwen2.5-72B-Chat 77.85 61.24 -16.61
Llama3.1-Aloe-Beta-70B 79.73 65.02 -14.72
DeepSeek-R1 82.48 75.86 -6.62

The performance drop ranges from -6.62% for DeepSeek-R1 to nearly -30% for Llama-3.1-8B-Instruct. While models with stronger reasoning capabilities show more robustness, the gap highlights the increased difficulty of generating accurate and relevant free-text medical answers compared to selecting from predefined options. This underscores the need for targeted research and evaluation methods for OE-QA in healthcare. OpenMedQA is publicly released to support this effort.

Empowering the Community

True innovation happens when the community collaborates. We’re proud to make all our resources available to you:

  • Prompt Engine: Get started with our optimized retrieval pipeline.
  • OpenMedQA: Evaluate your models using a benchmark designed for the complexities of open-ended medical question answering.
  • CoT/ToT/Thinking databases: Access pre-built, reasoning-augmented datasets that can boost your model’s performance.

We invite you to explore these resources, share your experiments, and contribute to advancing accessible healthcare AI.

Conclusion & Future Directions

This research demonstrates that optimized context retrieval is a powerful technique for enhancing the performance and cost-effectiveness of open-source LLMs in the demanding healthcare domain. By carefully tuning the retrieval pipeline, we enabled open models to achieve state-of-the-art accuracy on medical MCQA benchmarks while significantly lowering the cost barrier compared to proprietary alternatives.

Our introduction of OpenMedQA highlights the remaining challenges in open-ended medical reasoning, where even top models exhibit performance drops. Future work should focus on further refining context retrieval strategies, potentially exploring adaptive ensemble methods specifically tailored for OE-QA, and integrating domain-specific retrieval mechanisms to bridge this gap.

By advancing these techniques, we can develop more reliable, affordable, and accessible AI solutions to support healthcare professionals and improve patient outcomes.

📄 Read the full paper

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment