Automated cardiac magnetic resonance interpretation derived from prompted large language models

Lujing Wang; Liang Peng; Yixuan Wan; Xingyu Li; Yixin Chen; Li Wang; Xiuxian Gong; Xiaoying Zhao; Lequan Yu; Shihua Zhao; Xinxiang Zhao

doi:10.21037/cdt-2025-112

Original Article

Automated cardiac magnetic resonance interpretation derived from prompted large language models

Lujing Wang^1# , Liang Peng^2# , Yixuan Wan³ , Xingyu Li¹ , Yixin Chen¹ , Li Wang¹ , Xiuxian Gong¹ , Xiaoying Zhao¹ , Lequan Yu² , Shihua Zhao⁴ , Xinxiang Zhao¹

¹Department of Radiology, The Second Affiliated Hospital of Kunming Medical University, Kunming, China; ²Department of Statistics and Actuarial Science, School of Computing and Data Science, The University of Hong Kong, Hong Kong, China; ³Department of Radiology, West China Hospital of Sichuan University, Chengdu, China; ⁴Department of Radiology, Imaging Center, Fuwai Hospital, National Center for Cardiovascular Diseases, State Key Laboratory of Cardiovascular Disease, Beijing, China

Contributions: (I) Conception and design: S Zhao, Xinxiang Zhao, L Yu, Lujing Wang, L Peng; (II) Administrative support: S Zhao, Xinxiang Zhao; (III) Provision of study materials or patients: Xinxiang Zhao; (IV) Collection and assembly of data: Y Wan, Xiaoying Zhao, X Li, Y Chen, Li Wang, X Gong; (V) Data analysis and interpretation: Lujing Wang, L Peng; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Xinxiang Zhao, MD. Department of Radiology, The Second Affiliated Hospital of Kunming Medical University, 374th Dianmian Road, Kunming 650101, China. Email: zhaoxinxiang2918@outlook.com; Shihua Zhao, MD. Department of Radiology, Imaging Center, Fuwai Hospital, National Center for Cardiovascular Diseases, State Key Laboratory of Cardiovascular Disease, No. 167 Beilishi Road, Beijing 100037, China. Email: cjrzhaoshihua2009@163.com.

Background: The versatility of cardiac magnetic resonance (CMR) leads to complex and time-consuming interpretation. Large language models (LLMs) present transformative potential for automated CMR interpretations. We explored the ability of LLMs in the automated classification and diagnosis of CMR reports for three common cardiac diseases: myocardial infarction (MI), dilated cardiomyopathy (DCM), and hypertrophic cardiomyopathy (HCM).

Methods: This retrospective study enrolled CMR reports of consecutive patients from January 2015 to July 2024, including reports from three types of cardiac diseases: MI, DCM, and HCM. Six LLMs, including GPT-3.5, GPT-4.0, Gemini-1.0, Gemini-1.5, PaLM, and LLaMA, were used to classify and diagnose the CMR reports. The results of the LLMs, with minimal or informative prompts, were compared with those of radiologists. Accuracy (ACC) and balanced accuracy (BAC) were used to evaluate the classification performance of the different LLMs. The consistency between radiologists and LLMs in classifying heart disease categories was evaluated using Gwet’s Agreement Coefficient (AC1 value). Diagnostic performance was analyzed through receiver operating characteristic (ROC) curves. Cohen’s kappa was used to assess the reproducibility of the LLMs’ diagnostic results obtained at different time intervals (a 30-day interval).

Results: This study enrolled 543 CMR cases, including 275 MI, 120 DCM, and 148 HCM cases. The overall BAC of the minimal prompted LLMs, from highest to lowest, were GPT-4.0, LLaMA, PaLM, GPT-3.5, Gemini-1.5, and Gemini-1.0. The informative prompted models of GPT-3.5 (P<0.001), GPT-4.0 (P<0.001), Gemini-1.0 (P<0.001), Gemini-1.5 (P=0.02), and PaLM (P<0.001) showed significant improvements in overall ACC compared to their minimal prompted models, whereas the informative prompted model of LLaMA did not show a significant improvement in overall ACC compared to the minimal prompted model (P=0.06). GPT-4.0 performed best in both the minimal prompted (ACC =88.6%, BAC =91.7%) and informative prompted (ACC =95.8%, BAC =97.1%) models. GPT-4.0 demonstrated the highest agreement with radiologists [AC1=0.82, 95% confidence interval (CI): 0.78–0.86], significantly outperforming others (P<0.001). For the informative prompted models of LLMs, GPT-4.0 + informative prompt (AC1=0.93, 95% CI: 0.90–0.96), GPT-3.5 + informative prompt (AC1=0.93, 95% CI: 0.90–0.95), Gemini-1.0 + informative prompt (AC1=0.90, 95% CI: 0.87–0.93), PaLM + informative prompt (AC1=0.86, 95% CI: 0.82–0.90), LLaMA + informative prompt (AC1=0.82, 95% CI: 0.78–0.86), and Gemini-1.5 + informative prompt (AC1=0.80, 95% CI: 0.76–0.84) all showed almost perfect agreement with radiologists’ diagnoses. Diagnostic performance was excellent for GPT-4.0 [area under the curve (AUC)=0.93, 95% CI: 0.92–0.95] and LLaMA (AUC =0.92, 95% CI: 0.90–0.94) in minimal prompted models, while informative prompted models achieved superior performance, with GPT-4.0 + informative prompt reaching the highest AUC of 0.98 (95% CI: 0.97–0.99). All models demonstrated good reproducibility (κ>0.80, P<0.001).

Conclusions: LLMs demonstrated outstanding performance in the automated classification and diagnosis of targeted CMR interpretations, especially with informative prompts, suggesting the potential for these models to serve as adjunct tools in CMR diagnostic workflows.

Keywords: Artificial intelligence (AI); large language models (LLMs); automated diagnosis; cardiac magnetic resonance (CMR)

Submitted Feb 28, 2025. Accepted for publication May 15, 2025. Published online Aug 28, 2025.

doi: 10.21037/cdt-2025-112

Introduction

Background

Cardiac magnetic resonance (CMR) has become an essential tool for diagnosing and managing cardiovascular diseases (1,2). It provides versatile information due to the multi-modality nature, serving as the gold standard for evaluating cardiac structure and function (3,4). However, CMR interpretations require a high level of expertise from radiologists. The classification and diagnosis of CMR reports rely on experience due to the complexity of CMR, which are often time-consuming and subject to variability (4).

With the rapid advancement of artificial intelligence (AI) technologies, particularly in the development of large language models (LLMs), AI-assisted diagnosis has gained transformative potential. LLMs possess robust text comprehension and generation capabilities, and are able to learn and extract complex linguistic patterns and clinical knowledge from large amounts of medical text data. Researchers have explored the application of LLMs across various clinical tasks, such as automatic determination of radiology protocols from request forms (5), extraction of key information from imaging reports (6), and automatic disease classification (7,8), achieving promising results.

Rationale and knowledge gap

Integrating LLMs into clinical workflows has the potential to improve diagnostic efficiency, reduce human error, and provide clinical decision support (9). Research has confirmed that GPT-4.0 could accurately understand and extract valid information from CMR reports (6). Consequently, LLMs hold promise for the automated and precise interpretation and classification of CMR reports (10). This advancement could not only offer new insights into the application of AI in medical imaging interpretation but also drive the development of intelligent diagnostic tools and enhance the diagnosis and management of cardiovascular diseases. However, further exploration is needed to fully understand the capability of general LLMs in handling complex medical terminology in CMR reports, generating automated interpretations, and their broader applicability in real-world clinical settings. Myocardial infarction (MI), hypertrophic cardiomyopathy (HCM), and dilated cardiomyopathy (DCM) represent three prevalent cardiac disorders that pose significant clinical burdens and diagnostic challenges. MI remains the leading cause of cardiovascular mortality (11), exhibiting high incidence rates in rural China yet insufficient reperfusion therapy utilization. HCM, as the most common inherited heart disease, demonstrates substantial underdiagnosis rates and elevated sudden cardiac death risk (12). DCM accounts for 15–20% of heart failure cases with a 5-year survival rate merely reaching 50–60%, while its etiological heterogeneity (viral, genetic, etc.) substantially compromises conventional diagnostic efficiency (13).

Objective

Thus, this study aimed to evaluate the performance of six widely used general LLMs—GPT-3.5, GPT-4.0, Gemini-1.0, Gemini-1.5, PaLM, and LLaMA—in both their original and prompted models. The assessment focused on their ability to automatically classify and diagnose CMR reports, as well as their agreement with diagnoses made by radiologists. This study was intended to establish baseline performance of LLMs in CMR interpretation and provide a foundation for intelligent assisted diagnosis in CMR. We present this article in accordance with the STARD reporting checklist (available at https://cdt.amegroups.com/article/view/10.21037/cdt-2025-112/rc).

Methods

Patient population

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of The Second Affiliated Hospital of Kunming Medical University (No. YJ2024112) and individual consent for this retrospective analysis was waived. This study included radiology reports of consecutive patients who underwent CMR examinations at The Second Affiliated Hospital of Kunming Medical University from January 2015 to July 2024. All CMR reports were initially written by junior radiologists with 3–5 years of experience and were finally reviewed by senior radiologists with 10–15 years of experience. The structure and content of the reports referred to the Society for Cardiovascular Magnetic Resonance Guidelines for Reporting CMR (14), mainly interpreting the morphology and size of the heart, its motion and function, as well as the status of late gadolinium enhancement. The inclusion criteria were as follows: (I) patients aged at least 18 years old; (II) CMR reports with a confirmed diagnosis of MI, DCM, or HCM; (III) complete CMR examination, including cine and late gadolinium enhancement sequences; (IV) CMR reports were detailed and complete, with examples of the CMR reports templates provided in Appendix 1. Exclusion criteria included: (I) patients with incomplete CMR reports; (II) patients with mixed diagnoses in their CMR reports, such as MI combined with HCM (Figure 1).

Figure 1 Study flowchart. AI, artificial intelligence; CMR, cardiac magnetic resonance; DCM, dilated cardiomyopathy; HCM, hypertrophic cardiomyopathy; MI, myocardial infarction.

Collection and processing of reports

The CMR reports were exported into an Excel spreadsheet (Microsoft). Three independent readers, who were blinded to patient’s imaging information, reviewed all the CMR reports and translated the reports from Chinese to English. They removed all patient identification data, and extracted the patients’ gender, the imaging descriptions of CMR, and the diagnostic interpretations. The content of the CMR reports was preserved, with modifications limited to the correction of typographical errors and omissions. Based on the diagnosis of radiologists, the CMR reports were categorized into three groups: (I) MI, (II) DCM, and (III) HCM.

Evaluation and prompting strategy

The gender of each patient, the textual content of the CMR imaging descriptions, and the request to determine the diagnosis of the cardiac disease categories were directly input into the original models of six LLMs: GPT-3.5 (OpenAI), GPT-4.0 (OpenAI), Gemini-1.5 (Google), Gemini-1.0 (Google), PaLM (Google), and LLaMA (Meta). This approach established the LLMs + minimal prompt models. Subsequently, to obtain outputs for specific scenarios of CMR interpretations for three types of cardiac diseases, diagnostic prompts related to CMR for each disease were provided before each diagnostic classification query. This approach established the LLMs + informative prompt models. The informative prompt content was accordance with the Fourth Universal Definition of Myocardial Infarction [2018] (15) and the 2023 ESC guidelines for the management of cardiomyopathies (16). The definitions were as follows: (I) For MI, a high signal originating in the sub-endocardially and located within the coronary artery supply territory was observed on the late gadolinium enhancement imaging of CMR, with or without associated segmental wall motion abnormalities. (II) For DCM, left ventricular dilatation and systolic dysfunction were present, which cannot be attributed solely to abnormal loading conditions or coronary artery disease. In adults, left ventricular dilatation was defined as an end-diastolic diameter >58 mm in males or >52 mm in females. Overall left ventricular systolic dysfunction was defined as a left ventricular ejection fraction <50%. (III) For HCM, any myocardial segment with a left ventricular wall thickness of ≥15 mm at end-diastole, which could not be solely explained by abnormal loading, was considered hypertrophic. Examples of input and output data for both the minimal prompted and informative prompted LLMs could be found in Appendix 2. The LLMs did not have access to the radiologist-diagnosed cardiac disease categories from the CMR reports. The chat session was reset after each report was processed and a response was generated. All minimal and informative prompted LLMs were tested between August 1, 2024, and August 20, 2024. To assess the reproducibility of the LLMs, 30 reports from each cardiac disease category were randomly selected and re-entered into the LLMs after a 30-day interval.

Statistical analysis

Accuracy (ACC), balanced accuracy (BAC) and confusion matrix visuals were utilized to evaluate the classification diagnostic performance of the LLMs. BAC is the average of recall obtained on each class. For a multi-class classification problem with K classes, it is defined as:

$B A C = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F N_{k}}$ [1]

Where $T P_{k}$ is the True positives for class $k$ , and $F N_{k}$ is the False negatives for class $k$ . In other words, it is the mean of each class’s recall.

The Gwet agreement coefficient (AC1) values and their 95% confidence intervals (CIs) (17,18) were calculated using the sklearn library in Python 3.11 (Python Software Foundation, Wilmington, DE, USA) to assess the agreement between radiologists and LLMs in categorizing cardiac diseases. Agreement levels were interpreted based on the Landis and Koch scale (19): 0–0.20 indicated slight agreement; 0.21–0.40 indicated fair agreement; 0.41–0.60 indicated moderate agreement; 0.61–0.80 indicated substantial agreement; 0.81–0.99 indicated almost perfect agreement; 1.00 indicated perfect agreement. The McNemar test was employed to compare the number of correct diagnoses between the minimal prompted and the informative prompted models of LLMs, as well as to evaluate the discordance between radiologists and LLMs. The diagnostic efficacy of different LLMs in categorizing CMR reports was further assessed by the area under the curves (AUCs) of the receiver operating characteristic (ROC) curves along with their 95% CI. Reproducibility of the diagnostic classification results of CMR reports obtained by LLMs over different times was examined by Cohen’s kappa analysis in SPSS 26.0 (IBM, Armonk, NY, USA). All statistical tests in this study were two-sided, with statistical significance set at P<0.05.

Results

Study sample

As shown in the study flowchart (Figure 1), among the 1,505 patients who underwent CMR examinations, 602 were clearly diagnosed with MI, HCM, or DCM. After excluding 13 cases with incomplete imaging reports and 46 cases with mixed diagnoses, a total of 543 CMR cases were finally enrolled, with 393 males (72.4%). Among these, there were 275 cases of MI, with 208 males (75.6%); 120 cases of DCM, with 93 males (77.5%); and 148 cases of HCM, with 92 males (62.2%).

Accuracy of LLMs for CMR interpretations

The number of correctly classified diagnoses and both ACC and BAC for the minimal prompted and informative prompted LLMs across MI, DCM, HCM, and overall are presented in Figure 2. Considering the class imbalance in the dataset (MI: 50.6%, DCM: 22.1%, HCM: 27.3%), BAC was included as a key metric. Among minimal prompted models, GPT-4.0 + minimal prompt achieved the highest overall ACC and BAC (88.6%, 91.7%), followed by LLaMA + minimal prompt (86.6%, 90.0%), GPT-3.5 + minimal prompt (85.8%, 85.3%), PaLM + minimal prompt (84.5%, 86.5%), Gemini-1.5 + minimal prompt (81.4%, 84.5%), and Gemini-1.0 + minimal prompt (82.5%, 79.3%). Informative prompted models consistently outperformed their minimal prompted counterparts in both metrics. GPT-4.0 + informative prompt reached the highest ACC and BAC (95.8%, 97.1%), followed by GPT-3.5 + informative prompt (95.4%, 94.4%), Gemini-1.0 + informative prompt (93.7%, 94.8%), PaLM + informative prompt (91.2%, 93.7%), LLaMA + informative prompt (88.0%, 91.7%), and Gemini-1.5 + informative prompt (86.9%, 90.7%). Significant improvements (P<0.05) in ACC were observed for the informative prompted models of GPT-3.5, GPT-4.0, Gemini-1.0, Gemini-1.5, and PaLM. However, LLaMA + informative prompt, showed no statistically significant gain (P=0.06). The confusion matrices for each model’s diagnostic predictions across the three cardiac conditions and the entire cohort are shown in Figure 3.

Figure 2 Accuracy of LLMs for CMR interpretations. LLMs demonstrated high ACC and most informative prompted models showed significant improvements in ACC compared to their minimal models. ※, balanced accuracy. *, P<0.05; **, P<0.001. ACC, accuracy; CMR, cardiac magnetic resonance; DCM, dilated cardiomyopathy; HCM, hypertrophic cardiomyopathy; LLMs, large language models; MI, myocardial infarction.

Figure 3 The confusion matrix of the interpretation results of LLMs. The percentage of diagnostic results for the three cardiac diseases is displayed on a color gradient scale (A-L). DCM, dilated cardiomyopathy; HCM, hypertrophic cardiomyopathy; LLMs, large language models; MI, myocardial infarction.

Agreement assessment between LLMs and radiologists

The agreement between the diagnostic results of LLMs and radiologists is illustrated in Table 1. Among the minimal models of LLMs, GPT-4.0 + minimal prompt (AC1=0.82, 95% CI: 0.78–0.86) demonstrated almost perfect agreement with the radiologists’ CMR report diagnoses, while LLaMA + minimal prompt (AC1=0.79, 95% CI: 0.75–0.83), GPT-3.5 + minimal prompt (AC1=0.77, 95% CI: 0.73–0.82), PaLM + minimal prompt (AC1=0.76, 95% CI: 0.71–0.81), Gemini-1.0 + minimal prompt (AC1=0.71, 95% CI: 0.66–0.76), and Gemini-1.5 + minimal prompt (AC1=0.71, 95% CI: 0.67–0.76) showed substantial agreement. The agreement of GPT-4.0 + minimal prompt was significantly higher than that of the other models (P<0.001). For the informative prompted models of LLMs, GPT-4.0 + informative prompt (AC1=0.93, 95% CI: 0.90–0.96), GPT-3.5 + informative prompt (AC1=0.93, 95% CI: 0.90–0.95), Gemini-1.0 + informative prompt (AC1=0.90, 95% CI: 0.87–0.93), PaLM + informative prompt (AC1=0.86, 95% CI: 0.82–0.90), LLaMA + informative prompt (AC1=0.82, 95% CI: 0.78–0.86), and Gemini-1.5 + informative prompt (AC1=0.80, 95% CI: 0.76–0.84) all showed almost perfect agreement with radiologists’ diagnoses. The agreement of GPT-4.0 + informative prompt and GPT-3.5 + informative prompt was significantly higher than that of the other models (P<0.001).

Table 1

Agreement in CMR interpretations between LLMs and radiologists

Comparison	Gwet AC1 (95% CI)	P value
Minimal prompted LLMs		<0.001*
Human-GPT-3.5 + minimal prompt	0.77 (0.73, 0.82)
Human-GPT-4.0 + minimal prompt	0.82 (0.78, 0.86)
Human-Gemini-1.0 + minimal prompt	0.71 (0.66, 0.76)
Human-Gemini-1.5 + minimal prompt	0.71 (0.67, 0.76)
Human-PaLM + minimal prompt	0.76 (0.71, 0.81)
Human-LLaMA + minimal prompt	0.79 (0.75, 0.83)
Informative prompted LLMs		<0.001*
Human-GPT-3.5 + informative prompt	0.93 (0.90, 0.95)
Human-GPT-4.0 + informative prompt	0.93 (0.90, 0.96)
Human-Gemini-1.0 + informative prompt	0.90 (0.87, 0.93)
Human-Gemini-1.5 + informative prompt	0.80 (0.76, 0.84)
Human-PaLM + informative prompt	0.86 (0.82, 0.90)
Human-LLaMA + informative prompt	0.82 (0.78, 0.86)

*, statistically significant at P<0.05. AC1, agreement coefficient; CI, confidence interval; CMR, cardiac magnetic resonance; LLMs, large language models.

Evaluation of the diagnostic performance of LLMs

The diagnostic classification performance of both the minimal prompted and informative prompted LLMs is illustrated in Figure 4. All minimal prompted LLMs exhibited good diagnostic performance. GPT-4.0 + minimal prompt achieved the highest overall AUC of 0.93 (95% CI: 0.92–0.95), followed by LLaMA + minimal prompt (AUC =0.92, 95% CI: 0.90–0.94), PaLM + minimal prompt (AUC =0.89, 95% CI: 0.87–0.92), GPT-3.5 + minimal prompt (AUC =0.89, 95% CI: 0.87–0.91), Gemini-1.5 + minimal prompt (AUC =0.88, 95% CI: 0.86–0.90), and Gemini-1.0 + minimal prompt (AUC =0.84, 95% CI: 0.82–0.87). In the MI subgroup, GPT-4.0 + minimal prompt and GPT-3.5 + minimal prompt both achieved an AUC of 0.90, while Gemini-1.0 + minimal prompt had the lowest AUC at 0.83. For DCM, GPT-4.0 + minimal prompt again demonstrated the strongest performance with an AUC of 0.94, and Gemini-1.0 + minimal prompt also showed a relatively lower AUC of 0.79. In HCM, the diagnostic performance of all models was generally high. GPT-4.0 + minimal prompt and LLaMA + minimal prompt both reached an AUC of 0.97, PaLM + minimal prompt achieved 0.96, while GPT-3.5 + minimal prompt achieved 0.89, and both Gemini-1.5 + minimal prompt and Gemini-1.0 + minimal prompt achieved 0.92. All informative prompted models of LLMs achieved excellent diagnostic performance. GPT-4.0 + informative prompt achieved the highest overall AUC of 0.98 (95% CI: 0.97–0.99), followed by GPT-3.5 + informative prompt (AUC =0.96, 95% CI: 0.95–0.98), Gemini-1.0 + informative prompt (AUC =0.96, 95% CI: 0.95–0.97), PaLM + informative prompt (AUC =0.95, 95% CI: 0.94–0.96), LLaMA + informative prompt (AUC =0.93, 95% CI: 0.92–0.95), and Gemini-1.5 + informative prompt (AUC =0.92, 95% CI: 0.91–0.94). In the MI subgroup, GPT-3.5 + informative prompt achieved the highest AUC of 0.98, followed closely by GPT-4.0 + informative prompt (0.96), while Gemini-1.5 + informative prompt showed the lowest performance at 0.87. For DCM, GPT-4.0 + informative prompt again demonstrated the best performance with an AUC of 0.98, and Gemini-1.0 + informative prompt and PaLM + informative prompt also performed well (both 0.95), whereas LLaMA + informative prompt and Gemini-1.5 + informative prompt reported lower values (0.93). In the HCM subgroup, all models performed strongly. GPT-4.0 + informative prompt achieved the highest AUC of 0.99, followed by Gemini-1.0 + informative prompt, PaLM + informative prompt and LLaMA + informative prompt (all 0.98), while GPT-3.5 + informative prompt showed a relatively lower value of 0.93.

Figure 4 Receiver operating characteristic curves for LLMs. Performances of the minimal prompted (A-D) and informative prompted (E-H) LLMs for three cardiac diseases are reflected by AUC. AUC, area under the curve; DCM, dilated cardiomyopathy; HCM, hypertrophic cardiomyopathy; LLMs, large language models; MI, myocardial infarction.

Reproducibility evaluation of LLMs

The reproducibility evaluation of LLMs is presented in Table 2. Both the minimal prompted models and informative prompted models of the six LLMs exhibited strong reproducibility in their classification and diagnosis of CMR reports over different periods. The κ values of GPT-3.5 + minimal prompt, GPT-4.0 + minimal prompt, Gemini-1.0 + minimal prompt, Gemini-1.5 + minimal prompt, PaLM + minimal prompt, and LLaMA + minimal prompt were 0.81, 0.95, 0.82, 0.98, 0.98, and 0.92, respectively (P<0.001). The κ values of the prompted models—GPT-3.5 + informative prompt, GPT-4.0 + informative prompt, Gemini-1.0 + informative prompt, Gemini-1.5 + informative prompt, PaLM + informative prompt, and LLaMA + informative prompt—were 0.83, 0.87, 0.97, 0.98, 0.98, and 0.97, respectively (P<0.001).

Table 2

Reproducibility evaluation of LLMs

Comparison	Κ value	P value
GPT-3.5 + minimal prompt	0.81	<0.001*
GPT-4.0 + minimal prompt	0.95	<0.001*
Gemini-1.0 + minimal prompt	0.82	<0.001*
Gemini-1.5 + minimal prompt	0.98	<0.001*
PaLM + minimal prompt	0.98	<0.001*
LLaMA + minimal prompt	0.92	<0.001*
GPT-3.5 + informative prompt	0.83	<0.001*
GPT-4.0 + informative prompt	0.87	<0.001*
Gemini-1.0 + informative prompt	0.97	<0.001*
Gemini-1.5 + informative prompt	0.98	<0.001*
PaLM + informative prompt	0.98	<0.001*
LLaMA + informative prompt	0.97	<0.001*

*, statistically significant at P<0.05. LLMs, large language models.

Discussion

This study is the first to demonstrate the remarkable potential of LLMs for the automatic classification and diagnosis of CMR reports, particularly in diagnosing three common cardiac diseases: MI, DCM, and HCM. Among the six LLMs evaluated, GPT-4.0 exhibited the highest diagnostic performance, especially when enhanced with informative prompts. The informative prompted models, including GPT-4.0 + informative prompt, GPT-3.5 + informative prompt, Gemini-1.0 + informative prompt, Gemini-1.5 + informative prompt, and PaLM + informative prompt showed significant improvements in diagnostic ACC compared to their original models. Notably, all six LLMs, both in their minimal prompted and informative prompted models, maintained strong reproducibility over time.

The potential of LLMs in medical applications has garnered increasing interest due to their advanced text comprehension and generation capabilities, which have proven effective in various radiological tasks (20,21). Le Guellec et al. demonstrated that the local open-source LLM, Vicuna, could perform information extraction from free-text radiology reports without additional training (22). Multiple recent studies have demonstrated that LLMs could reliably extract actionable information from radiology reports and effectively integrate complex medical data (23-25). Van Veen et al. confirmed that LLMs outperformed medical experts in clinical text summarization tasks, including radiology reports, thus reducing the documentation burdens in clinical workflows (9). Lehnen et al. highlighted the effectiveness of ChatGPT in extracting data from free-text reports of mechanical thrombectomy for acute ischemic stroke (6). Additionally, Salam et al. showed that GPT-4.0 could reliably convert complex CMR reports into layperson-friendly language while largely maintaining factual ACC and completeness (26). In our study, LLMs demonstrated high diagnostic efficacy in classifying CMR reports even without specialized training, which was consistent with findings from other clinical tasks (9,22,26). The ACC and AUC of the LLMs showed consistent performance, with the overall ACC of all LLMs exceeding 80.00% and the AUC surpassing 0.80. Among the minimal prompted LLMs we evaluated, GPT-4.0 + minimal prompt performed exceptionally well, outperforming the other LLMs. Specifically, GPT-4.0 + minimal prompt achieved an ACC of 80.0% for MI, 99.2% for DCM, and 95.9% for HCM, with an overall ACC of 88.6% and an AUC of 0.93.

Compared with minimal prompted LLMs, informative prompted LLMs generally demonstrated significant improvements in diagnostic performance. Among the informative prompted LLMs, GPT-4.0 + informative prompt maintained the best performance, achieving an ACC of 92.0% for MI, 100.0% for DCM, and 99.3% for HCM, with an overall ACC of 95.8% and an AUC of 0.98. Other informative prompted models, including GPT-3.5 + informative prompt, Gemini-1.0 + informative prompt, Gemini-1.5 + informative prompt, and PaLM + informative prompt also showed significant improvements, with ACCs substantially higher than their minimal prompted LLMs. These findings underscore the strong contextual learning abilities of LLMs and highlight the value of incorporating domain-specific prompts to improve diagnostic ACC. Rau et al. similarly noted that a chatbot equipped with appropriateness criteria knowledge outperformed generic versions of ChatGPT and radiologists (27). This suggested that prompts fine-tuned with specific medical knowledge could enhance performance in medical imaging tasks, thereby narrowing the gap between AI and expert diagnoses. Moreover, informative prompts not only improved diagnostic accuracy but also enhanced model interpretability (28). While the “black-box” nature of LLMs remains a challenge in clinical settings (28-31), embedding clear medical standards within prompts could mitigate this issue and increase trust among physicians in AI-driven diagnostics. It was worth noting that the ACC and AUC of Gemini-1.5 + informative prompt were lower than those of Gemini-1.0 + informative prompt. This discrepancy could be due to factors such as model complexity, overfitting, or differences in training strategies (28,32). Additionally, the diagnostic improvement of LLaMA + informative prompt was not as significant, which might indicate that the model’s architecture or training data may not be as amenable to informative prompt-based enhancements (32,33). The persistence of erroneous diagnoses in a subset of cases even when supplemented with informational prompts suggests that LLM training protocols should incorporate comprehensive instruction on pathognomonic imaging feature weighting and diagnostic reasoning frameworks, rather than relying solely on threshold-based diagnostic classification. Future research should explore alternative prompt strategies or model adjustments to address this limitation.

The agreement with radiologists’ diagnoses is a critical measurement of the effectiveness of AI models in clinical applications. In this study, GPT-4.0 exhibited the highest agreement with radiologist diagnoses among the minimal prompted LLMs, achieving an AC1 value of 0.82, which indicated strong agreement and was consistent with its high ACC rate. The introduction of informative prompts further improved this agreement. Both GPT-4.0 + informative prompt and GPT-3.5 + informative prompt reached an AC1 value of 0.93, reflecting excellent agreement with radiologist diagnoses and aligning with the findings reported by Gertz et al. (34). Reproducibility is another essential criterion in evaluating the reliability of AI diagnostics. Our study demonstrated that all LLMs, including both minimal prompted and informative prompted models, showed strong reproducibility, with κ values exceeding 0.80. This suggested that LLMs could offer stable and reliable diagnostic capabilities (8), potentially reducing variability due to human interpretation and improving the overall reliability and efficiency of diagnostic workflows. These findings further underscore their potential utility in clinical practice.

Limitations

While the results of this study are encouraging, several challenges remain for the practical implementation of LLMs in clinical settings. First, considering methodological rationale and drawing an analogy to radiologist training—“mastering typical cases before tackling complex ones”, this single-center study only validated the diagnostic performance of LLMs for three common cardiac diseases. Future research should focus on expanding with a wide range of disease datasets to encompass a wider range of cardiac conditions and complex clinical scenarios involving comorbidities, while also accounting for the variability in CMR reports across different institutions. Moreover, we planned to integrate LLMs with multimodal data, including imaging analysis and clinical information, to explore their broader potential in intelligent radiological diagnostics.

Conclusions

LLMs demonstrated promising performance in the preliminary automatic classification and diagnosis of CMR reports for MI, DCM, and HCM, particularly when guided by designed prompts. Their notable agreement with radiologists’ diagnoses suggests potential as supportive tools in intelligent imaging diagnostics. While further validation across broader and more diverse datasets is needed, these models may eventually assist clinical decision-making and contribute to improving the diagnosis workflow and management of cardiovascular diseases.

Acknowledgments

Professor Liping He kindly provided statistical advice for this manuscript. We would like to thank Dr. Howard Chan Tsai Hor for his help in polishing our paper.

Footnote

Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://cdt.amegroups.com/article/view/10.21037/cdt-2025-112/rc

Data Sharing Statement: Available at https://cdt.amegroups.com/article/view/10.21037/cdt-2025-112/dss

Peer Review File: Available at https://cdt.amegroups.com/article/view/10.21037/cdt-2025-112/prf

Funding: This study was supported by Kunming Medical University Educational Research Project (No. 2025-JY-Z-03 to Xinxiang Zhao), Yunnan Provincial Science and Technology Platform and Talent Project (Academician Expert Workstation) (No. 202305AF150033 to S.Z. and Xinxiang Zhao), and Yunnan Revitalization Talent Support Program (No. XDYC-YLWS-2023-0022 to Xinxiang Zhao).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://cdt.amegroups.com/article/view/10.21037/cdt-2025-112/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of The Second Affiliated Hospital of Kunming Medical University (No. YJ2024112) and individual consent for this retrospective analysis was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Hundley WG. Fifty Years of Cardiovascular Magnetic Resonance: Continuing Evolution Toward the "One-Stop Shop" for Cardiovascular Diagnosis. Circulation 2024;149:1859-61. [Crossref] [PubMed]
Pennell DJ, Mohiaddin RH. Cardiovascular Magnetic Resonance: Past, Present, and Future. Circ Cardiovasc Imaging 2024;17:e016523. [Crossref] [PubMed]
Ibanez B, Aletras AH, Arai AE, et al. Cardiac MRI Endpoints in Myocardial Infarction Experimental and Clinical Trials: JACC Scientific Expert Panel. J Am Coll Cardiol 2019;74:238-56. [Crossref] [PubMed]
Wang YJ, Yang K, Wen Y, et al. Screening and diagnosis of cardiovascular disease using artificial intelligence-enabled cardiac magnetic resonance imaging. Nat Med 2024;30:1471-80. [Crossref] [PubMed]
Gertz RJ, Bunck AC, Lennartz S, et al. GPT-4 for Automated Determination of Radiological Study and Protocol based on Radiology Request Forms: A Feasibility Study. Radiology 2023;307:e230877. [Crossref] [PubMed]
Lehnen NC, Dorn F, Wiest IC, et al. Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis. Radiology 2024;311:e232741. [Crossref] [PubMed]
Li D, Gupta K, Bhaduri M, et al. Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases. Radiology 2024;310:e232411. [Crossref] [PubMed]
Cozzi A, Pinker K, Hidber A, et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 2024;311:e232133. [Crossref] [PubMed]
Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024;30:1134-42. [Crossref] [PubMed]
Savage CH, Kanhere A, Parekh V, et al. Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment. Radiology 2025;314:e241073. [Crossref] [PubMed]
Vaduganathan M, Mensah GA, Turco JV, et al. The Global Burden of Cardiovascular Diseases and Risk: A Compass for Future Health. J Am Coll Cardiol 2022;80:2361-71. [Crossref] [PubMed]
Maron BJ, Desai MY, Nishimura RA, et al. Diagnosis and Evaluation of Hypertrophic Cardiomyopathy: JACC State-of-the-Art Review. J Am Coll Cardiol 2022;79:372-89. [Crossref] [PubMed]
Heymans S, Lakdawala NK, Tschöpe C, et al. Dilated cardiomyopathy: causes, mechanisms, and current and future treatment approaches. Lancet 2023;402:998-1011. [Crossref] [PubMed]
Hundley WG, Bluemke DA, Bogaert J, et al. Society for Cardiovascular Magnetic Resonance (SCMR) guidelines for reporting cardiovascular magnetic resonance examinations. J Cardiovasc Magn Reson 2022;24:29. [Crossref] [PubMed]
Thygesen K, Alpert JS, Jaffe AS, et al. Fourth Universal Definition of Myocardial Infarction (2018). J Am Coll Cardiol 2018;72:2231-64. [Crossref] [PubMed]
Arbelo E, Protonotarios A, Gimeno JR, et al. 2023 ESC Guidelines for the management of cardiomyopathies. Eur Heart J 2023;44:3503-626. [Crossref] [PubMed]
Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol 2008;61:29-48. [Crossref] [PubMed]
Klein D. Implementing a general framework for assessing interrater agreement in Stata. Stata J 2018;18:871-901.
Kundel HL, Polansky M. Measurement of observer agreement. Radiology 2003;228:303-8. [Crossref] [PubMed]
Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology 2024;310:e232756. [Crossref] [PubMed]
Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology 2023;307:e230582. [Crossref] [PubMed]
Le Guellec B, Lefèvre A, Geay C, et al. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. Radiol Artif Intell 2024;6:e230364. [Crossref] [PubMed]
Li CY, Chang KJ, Yang CF, et al. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation. Nat Commun 2025;16:2258. [Crossref] [PubMed]
Bhayana R, Jajodia A, Chawla T, et al. Accuracy of Large Language Model-based Automatic Calculation of Ovarian-Adnexal Reporting and Data System MRI Scores from Pelvic MRI Reports. Radiology 2025;315:e241554. [Crossref] [PubMed]
Bluethgen C, Van Veen D, Zakka C, et al. Best Practices for Large Language Models in Radiology. Radiology 2025;315:e240528. [Crossref] [PubMed]
Salam B, Kravchenko D, Nowak S, et al. Generative Pre-trained Transformer 4 makes cardiovascular magnetic resonance reports easy to understand. J Cardiovasc Magn Reson 2024;26:101035. [Crossref] [PubMed]
Rau A, Rau S, Zoeller D, et al. A Context-based Chatbot Surpasses Trained Radiologists and Generic ChatGPT in Following the ACR Appropriateness Guidelines. Radiology 2023;308:e230970. [Crossref] [PubMed]
ChengJLiuXZhengKBlack-box prompt optimization: Aligning large language models without model training.2023. arXiv: 2311.04155.
Schwartz IS, Link KE, Daneshjou R, et al. Black Box Warning: Large Language Models and the Future of Infectious Diseases Consultation. Clin Infect Dis 2024;78:860-6. [Crossref] [PubMed]
Zini J, Awad M. On the Explainability of Natural Language Processing Deep Models. ACM Comput Surv 2022;55:1-31.
Luo S, Ivison H, Han S, et al. Local Interpretations for Explainable Natural Language Processing: A Survey. ACM Comput Surv 2024;56:1-36.
Zhou L, Schellaert W, Martínez-Plumed F, et al. Larger and more instructable language models become less reliable. Nature 2024;634:61-8. [Crossref] [PubMed]
Kresevic S, Giuffrè M, Ajcevic M, et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit Med 2024;7:102. [Crossref] [PubMed]
Gertz RJ, Dratsch T, Bunck AC, et al. Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy. Radiology 2024;311:e232714. [Crossref] [PubMed]

Cite this article as: Wang L, Peng L, Wan Y, Li X, Chen Y, Wang L, Gong X, Zhao X, Yu L, Zhao S, Zhao X. Automated cardiac magnetic resonance interpretation derived from prompted large language models. Cardiovasc Diagn Ther 2025;15(4):726-737. doi: 10.21037/cdt-2025-112

Automated cardiac magnetic resonance interpretation derived from prompted large language models

Introduction

Background

Rationale and knowledge gap

Objective

Methods

Patient population

Collection and processing of reports

Evaluation and prompting strategy

Statistical analysis

Results

Study sample

Accuracy of LLMs for CMR interpretations

Agreement assessment between LLMs and radiologists

Table 1

Evaluation of the diagnostic performance of LLMs

Reproducibility evaluation of LLMs

Table 2

Discussion

Limitations

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share