Skip to main content
Erschienen in: Journal of Medical Systems 1/2023

01.12.2023 | Original Paper

Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

verfasst von: Haifeng Song, Yi Xia, Zhichao Luo, Hui Liu, Yan Song, Xue Zeng, Tianjie Li, Guangxin Zhong, Jianxing Li, Ming Chen, Guangyuan Zhang, Bo Xiao

Erschienen in: Journal of Medical Systems | Ausgabe 1/2023

Einloggen, um Zugang zu erhalten

Abstract

Objectives

To evaluate the effectiveness of four large language models (LLMs) (Claude, Bard, ChatGPT4, and New Bing) that have large user bases and significant social attention, in the context of medical consultation and patient education in urolithiasis.

Materials and methods

In this study, we developed a questionnaire consisting of 21 questions and 2 clinical scenarios related to urolithiasis. Subsequently, clinical consultations were simulated for each of the four models to assess their responses to the questions. Urolithiasis experts then evaluated the model responses in terms of accuracy, comprehensiveness, ease of understanding, human care, and clinical case analysis ability based on a predesigned 5-point Likert scale. Visualization and statistical analyses were then employed to compare the four models and evaluate their performance.

Results

All models yielded satisfying performance, except for Bard, who failed to provide a valid response to Question 13. Claude consistently scored the highest in all dimensions compared with the other three models. ChatGPT4 ranked second in accuracy, with a relatively stable output across multiple tests, but shortcomings were observed in empathy and human caring. Bard exhibited the lowest accuracy and overall performance. Claude and ChatGPT4 both had a high capacity to analyze clinical cases of urolithiasis. Overall, Claude emerged as the best performer in urolithiasis consultations and education.

Conclusion

Claude demonstrated superior performance compared with the other three in urolithiasis consultation and education. This study highlights the remarkable potential of LLMs in medical health consultations and patient education, although professional review, further evaluation, and modifications are still required.
Anhänge
Nur mit Berechtigung zugänglich
Literatur
8.
Zurück zum Zitat Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int J Environ Res Public Health. 2023;20(4). https://doi.org/10.3390/ijerph20043378 Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int J Environ Res Public Health. 2023;20(4). https://​doi.​org/​10.​3390/​ijerph20043378
Metadaten
Titel
Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis
verfasst von
Haifeng Song
Yi Xia
Zhichao Luo
Hui Liu
Yan Song
Xue Zeng
Tianjie Li
Guangxin Zhong
Jianxing Li
Ming Chen
Guangyuan Zhang
Bo Xiao
Publikationsdatum
01.12.2023
Verlag
Springer US
Erschienen in
Journal of Medical Systems / Ausgabe 1/2023
Print ISSN: 0148-5598
Elektronische ISSN: 1573-689X
DOI
https://doi.org/10.1007/s10916-023-02021-3

Weitere Artikel der Ausgabe 1/2023

Journal of Medical Systems 1/2023 Zur Ausgabe