FRF-HHO: Early ovarian cancer prediction using explainable fuzzy random forest optimized by Harris Hawks algorithm
This study proposes an interpretable machine learning framework that integrates Fuzzy Random Forest (FRF) with Harris Hawks Optimization (HHO) for early ovarian cancer prediction using routine clinical data. The analysis was conducted on a publicly available dataset comprising 349 patient records with 51 clinical and biochemical features. To mitigate overfitting and data leakage, Recursive Feature Elimination with Cross-Validation (RFECV), preprocessing, and SMOTE–Tomek balancing were applied exclusively within the training data. A total of 31 relevant biomarkers were selected for model development. The HHO-optimized FRF achieved an accuracy of 94.12%, precision of 91.43%, recall of 96.07%, and an F1-score of 93.69%, outperforming several baseline ensemble and gradient boosting models evaluated under identical experimental conditions. Model interpretability was enhanced through SHAP and LIME analyses, which consistently identified AFP, HE4, CA125, and Age as influential predictors, aligning with established clinical knowledge. The high recall indicates strong sensitivity to cancer cases, an essential requirement for diagnostic support. Despite encouraging performance, the study is limited by its moderate sample size and a retrospective design. Consequently, the findings should be interpreted as preliminary. Future work will focus on validation using larger, multi-center cohorts and prospective studies to assess generalizability and clinical scalability. Graphical abstract
View Abstract
Ovarian cancer remains one of the most lethal gynecological malignancies, largely due to delayed diagnosis and the absence of reliable early screening tools. This study proposes an interpretable machine learning framework that integrates Fuzzy Random Forest (FRF) with Harris Hawks Optimization (HHO) for early ovarian cancer prediction using routine clinical data. The analysis was conducted on a publicly available dataset comprising 349 patient records with 51 clinical and biochemical features. To mitigate overfitting and data leakage, Recursive Feature Elimination with Cross-Validation (RFECV), preprocessing, and SMOTE–Tomek balancing were applied exclusively within the training data. A total of 31 relevant biomarkers were selected for model development. The HHO-optimized FRF achieved an accuracy of 94.12%, precision of 91.43%, recall of 96.07%, and an F1-score of 93.69%, outperforming several baseline ensemble and gradient boosting models evaluated under identical experimental conditions. Model interpretability was enhanced through SHAP and LIME analyses, which consistently identified AFP, HE4, CA125, and Age as influential predictors, aligning with established clinical knowledge. The high recall indicates strong sensitivity to cancer cases, an essential requirement for diagnostic support. Despite encouraging performance, the study is limited by its moderate sample size and a retrospective design. Consequently, the findings should be interpreted as preliminary. Future work will focus on validation using larger, multi-center cohorts and prospective studies to assess generalizability and clinical scalability. Graphical abstract