Background
Lung cancer poses a global health threat necessitating early detection and precise staging for improved patient outcomes. This study focuses on developing and validating a machine learning-based risk model for early lung cancer screening and staging, using routine clinical data.
Methods
Two medical center, observational, retrospective studies were conducted, involving 2312 lung cancer patients and 653 patients with benign nodules. Machine learning techniques, including differential analysis and feature selection, were employed to identify key factors for modeling. The study focused on variables such as nodule density, carcinoembryonic antigen (CEA), age, and lifestyle habits. The Logistic Regression model was utilized for early diagnoses, and the XGBoost model was utilized for staging based on selected features.
Results
For early diagnoses, the Logistic Regression model achieved an area under the curve (AUC) of 0.716 (95% confidence interval [CI] 0.607–0.826), with 0.703 sensitivity and 0.654 specificity. The XGBoost model excelled in distinguishing late-stage from early-stage lung cancer, exhibiting an AUC of 0.913 (95% CI 0.862–0.963), with 0.909 sensitivity and 0.814 specificity. These findings highlight the model’s potential for enhancing diagnostic accuracy and staging in lung cancer.
Conclusion
This study introduces a novel machine learning-based risk model for early lung cancer screening and staging, leveraging routine clinical information and laboratory data. The model shows promise in enhancing accuracy, mitigating overdiagnosis, and improving patient outcomes.