Referential hallucination and clinical reliability in large language models: a comparative analysis using regenerative medicine guidelines for chronic pain
- 01.10.2025
- Observational Research
- Verfasst von
-
Ozlem Kuculmez
Ozlem Kuculmez
- Department of Physical Medicine and Rehabilitation, Baskent University Alanya Hospital, 07400, Antalya, Turkey
-
Ahmet Usen
Korrespondierender Autor Ahmet Usen
- Faculty of Medicine, Department of Physical Medicine and Rehabilitation, Medipol University, 34815, Istanbul, Turkey
-
Emine Dündar Ahi
Emine Dündar Ahi
- Department of Physical Medicine and Rehabilitation, Kocaeli Health and Technology University, Kocaeli, Turkey
- Erschienen in
- Rheumatology International | Ausgabe 10/2025
Abstract
This study compared language models’ responses to open-ended questions on regenerative therapy guidelines for chronic pain, assessing their accuracy, reliability, usefulness, readability, semantic similarity, and hallucination rates. This cross-sectional study used 16 open-ended questions based on the American Society of Pain and Neuroscience’s regenerative therapy guidelines for chronic pain. Questions were answered by ChatGPT-4o, Gemini 2.5 Flash, and Claude 4 Opus. Responses were rated on a 7-point Likert scale for usability and reliability, and a 5-point scale for accuracy. Hallucinogenicity, readability (FKRE, FKGL), and similarity (USE, ROUGE-L) were also assessed. Statistical comparisons were made, with significance set at p < 0.05. Claude Opus 4 showed the highest reliability (5.19 ± 1.11), usefulness (5.06 ± 1.0), and clinical accuracy (4.06 ± 0.68), outperforming ChatGPT-4o (4.13 ± 0.96; 3.94 ± 0.85; 3.38 ± 0.72) and Gemini 2.5 (4.19 ± 0.98; 4.06 ± 0.93; 3.38 ± 0.62). Claude had the lowest reference hallucinations (RHS 4.44 ± 3.18) vs. ChatGPT-4o (8.38 ± 1.86) and Gemini 2.5 (8.75 ± 1.73). In semantic similarity, Claude (0.68 ± 0.08) and Gemini (0.65 ± 0.07) surpassed ChatGPT-4o (0.60 ± 0.09). Gemini led in ROUGE-L F1 (0.12 ± 0.03) vs. Claude (0.10 ± 0.02) and ChatGPT-4o (0.07 ± 0.03). Readability was similar, though Gemini had a higher FKGL (11.3 ± 1.06) than Claude (10.3 ± 2.09). Claude Opus 4 showed superior accuracy, reliability, and usefulness, with significantly fewer hallucinations. Readability scores were similar across models. Further research is recommended.
Anzeige
- Titel
- Referential hallucination and clinical reliability in large language models: a comparative analysis using regenerative medicine guidelines for chronic pain
- Verfasst von
-
Ozlem Kuculmez
Ahmet Usen
Emine Dündar Ahi
- Publikationsdatum
- 01.10.2025
- Verlag
- Springer Berlin Heidelberg
- Erschienen in
-
Rheumatology International / Ausgabe 10/2025
Print ISSN: 0172-8172
Elektronische ISSN: 1437-160X - DOI
- https://doi.org/10.1007/s00296-025-05996-z
Dieser Inhalt ist nur sichtbar, wenn du eingeloggt bist und die entsprechende Berechtigung hast.
Dieser Inhalt ist nur sichtbar, wenn du eingeloggt bist und die entsprechende Berechtigung hast.