Novel AI Framework for Detecting LLM "Hallucinations" in Medical Summaries

by Andrii Buvailo, PhD          News

Disclaimer: All opinions expressed by Contributors are their own and do not represent those of their employers, or BiopharmaTrend.com.
Contributors are fully responsible for assuring they own any required copyright for any content they submit to BiopharmaTrend.com. This website and its owners shall not be liable for neither information and content submitted for publication by Contributors, nor its accuracy.

  
Topics: HealthTech   
Share:   Share in LinkedIn  Share in Reddit  Share in X  Share in Hacker News  Share in Facebook  Send by email   |  

Mendel AI, a startup specializing in healthcare artificial intelligence (AI), in collaboration with the University of Massachusetts Amherst (UMass Amherst), has unveiled new research focused on addressing the challenge of "faithfulness hallucinations" in AI-generated medical summaries. This research is an important step toward ensuring the reliability and accuracy of AI applications in healthcare, particularly in clinical decision-making.

The research centers on large language models (LLMs) like GPT-4o and Llama-3, which have shown potential in generating medical summaries. However, these models are prone to hallucinations—instances where the AI generates incorrect or misleading information. These inaccuracies pose significant risks in medical contexts, potentially leading to misdiagnoses or inappropriate treatments.

The study categorizes hallucinations into five distinct types and introduces a detection framework designed to identify these errors systematically. A pilot study involving 100 medical summaries generated by GPT-4o and Llama-3 revealed that while GPT-4o produced longer summaries, it often made erroneous reasoning leaps, leading to more hallucinations. In contrast, Llama-3 generated fewer hallucinations by avoiding extensive inferences but at the cost of summary quality.

The detection framework identified specific inconsistencies in the models, including medical event inconsistencies, incorrect reasoning, and chronological errors. For instance, GPT-4o was found to have higher instances of incorrect reasoning and inconsistencies, whereas Llama-3 had fewer errors but produced lower-quality summaries.

The Hypercube System: A Tool for Automated Detection of LLM "Hallucinations"

To tackle the issue of hallucinations, the research explored automated methods that could mitigate the high costs and time associated with manual review. Central to this effort is the Hypercube system, which integrates medical knowledge bases, symbolic reasoning, and natural language processing (NLP) to detect hallucinations. This system provides a comprehensive representation of patient documents, allowing for an initial automated detection step before human expert review.

Dr. Wael Salloum, Chief Scientific Officer at Mendel AI, emphasized the importance of continually enhancing Hypercube’s capabilities. The system’s real-time data processing and adaptive learning algorithms are designed to keep it at the forefront of clinical innovation, ensuring reliable and accurate AI tools for healthcare.

The ongoing integration of AI into healthcare makes addressing hallucinations in LLM outputs increasingly crucial. Future research will focus on refining detection frameworks and exploring more advanced automated systems like Hypercube. The goal is to achieve the highest levels of accuracy and reliability in AI-generated medical content.

The academic community has recognized the significance of this work. The research paper titled "Faithfulness Hallucination Detection in Healthcare AI" has been accepted for presentation at the KDD AI conference in August 2024, detailing the methodologies and technologies behind Hypercube’s success.

Topics: HealthTech   

Share:   Share in LinkedIn  Share in Reddit  Share in X  Share in Hacker News  Share in Facebook  Send by email