Background
Hepatocellular carcinoma (HCC) management requires complex decision-making considering tumor burden, liver function, and patient’s functional performance status. Large language models (LLMs) show promise in clinical applications, but their utility in HCC treatment recommendations remains unexplored. We evaluated the clinical relevance of LLM-generated treatment recommendations by comparing concordance with real-world physician decisions and survival outcomes.
Methods and findings
We analyzed 13,614 treatment-naive HCC patients diagnosed between 2008 and 2020 in the Korean Primary Liver Cancer Registry. Treatment recommendations were generated using ChatGPT 4o, Gemini 2.0, and Claude 3.5 with standardized prompts referencing the American Association for the Study of Liver Diseases and the European Association for the Study of the Liver guidelines. Patients were classified as “matched” when LLM recommendations corresponded to actual treatments received. Overall survival (OS) was compared between matched and mismatched groups, stratified by the Barcelona Clinic Liver Cancer (BCLC) stage. Decision tree analysis identified factors influencing treatment selection patterns. Concordance rates between LLM recommendations and physician decisions were 31.1% (ChatGPT 4o), 32.7% (Gemini 2.0), and 26.8% (Claude 3.5). In BCLC-A patients, treatment concordance with LLM recommendations was associated with significantly improved survival (ChatGPT 4o HR: 0.743, 95% CI [0.665, 0.831], P < 0.001). Conversely, in BCLC-C patients, concordance was associated with worse survival outcomes (ChatGPT 4o HR: 1.650, 95% CI [1.523, 1.787], P < 0.001; Gemini 2.0 HR: 1.586, 95% CI [1.470, 1.711], P < 0.001; Claude 3.5 HR 1.483, 95% CI [1.366, 1.610], P < 0.001). In BCLC-B, concordance showed only modest or nonsignificant associations with survival across models. Decision tree analysis revealed that physicians prioritized liver function parameters, while LLMs emphasized tumor characteristics. In early-stage HCC, physicians avoided curative treatments when hepatic reserve was limited, whereas in advanced-stage HCC, physicians preferred locoregional therapies in patients with preserved liver function despite guideline recommendations for systemic therapy. This study is limited by its retrospective design, reliance on registry data without imaging information, and focus on guideline-era treatments, warranting future prospective validation.
Conclusions
Concordance between LLM-generated and physician treatment decisions was associated with improved survival in early-stage HCC, whereas this association was not observed in advanced-stage disease. While LLMs may serve as adjunctive tools for guideline-concordant decisions in straightforward scenarios, their recommendations may reflect limited contextual awareness in complex clinical situations requiring individualized care. LLM recommendations should be interpreted cautiously alongside clinical judgment.