UK Biobank Study
The UK Biobank is a large-scale population-based cohort aged 40-69 years, recruited between 2006 and 2010 from 22 UK-based assessment centers and aiming to follow up for 20 years.19. The dataset includes sociodemographic, psychosocial, lifestyle, family history, clinical, physical, cognitive, activity monitoring, biochemical, imaging, health linkage to a wide range of electronic health records, and genomic data from more than 500,000 participants.
At the initial visit, a touchscreen questionnaire and a computer-assisted interview were completed. Physical and functional measures and samples of blood, urine and saliva were collected. Baseline assessments were repeated on a smaller subset of the cohort (20,000–25,000 participants). The time interval between visits was approximately 4 years on average (range: 1-10 years).
All participants were registered with the UK National Health Service. Cancer outcomes were obtained from electronic health records, in-hospital episode statistics, the National Cancer Registry, self-reports validated by the study nurse, and death certification data. Outcomes were coded using the 10th revision of the International Classification of Diseases (ICD-10) system. The UK Biobank study was approved by the North West—Haydock Research Ethics Committee (21/NW/0157), all experiments were performed in accordance with relevant guidelines and regulations. Participants provided informed consent for the storage and use of their data by bona fide researchers conducting health research for the public good.
The initial sample size was 502,411 participants. 387,773 participants (77.2%) were healthy controls (HC), defined as those who were not diagnosed with cancer before or during the study. To define the CRC group, we used the ICD-10 codes for colonic malignancy, including cecum, appendix, ascending colon, hepatic flexure, transverse colon, splenic flexure, descending colon, sigmoid colon, overlapping colonic lesion, and unspecified colon ( C18 .0–9), rectosigmoid junction (C19) and rectum (C20). There were 2317 participants with CRC at baseline (0.46%) and 6237 incident cases (1.24%) at follow-up visits. Among incident cases, 6116 (98%) were diagnosed after the last study visit.
We included 72 sociodemographic, physical, medical, lifestyle, and biochemical measures in our analysis. Cancer-related variables were only used to describe the patient population and were not used as predictors in the analyses.
Age at screening visit, gender, ethnicity, Townsend deprivation index as an index of socioeconomic status, and highest educational qualification.
Height, body mass index (BMI), pulse, diastolic and systolic blood pressure (BP), waist-to-hip ratio, trunk-to-leg fat ratio, metabolic rate, impedance, grip strength of the dominant hand and self-esteem sleep reported in hours.
Family history of cancer and CRC, disease history for inflammatory bowel disease (IBD), cardiovascular disease (CVD), liver and biliary disease, diabetes, self-reported general health rating (on a scale of 1-4 excellent to poor), and regular use of aspirin and statins.
Frequency of smoking and alcohol consumption, frequency of eating oily fish, processed and red meats, sum of equivalent metabolic task (MET) minutes per week for all activities based on the International Physical Activity Questionnaire.
Complete blood count (FBC) and biochemistry
Biochemical measures were evaluated in serum and urine. Serum markers consisted of white blood cell count (WBC), red blood cell count (RBC), hemoglobin concentration, hematocrit percentage, platelet count, lymphocyte percentage, apolipoprotein A and B, urea, cholesterol, C-reactive protein ( CRP), cystatin C, high-density lipoprotein, insulin-like growth factor 1 (IGF-1), low-density lipoprotein, sex hormone-binding globulin (SHBG), testosterone, total protein, triglycerides, vitamin D, corpuscular volume medium, percentages of monocytes, neutrophils, eosinophils, basophils, nucleated red blood cells and reticulocytes, albumin, alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), direct bilirubin, calcium, gamma glutamyltransferase (GGT), glucose, glycated hemoglobin A1C (HbA1C), phosphate, total bilirubin and urate. An additional measure was included in the models as a covariate to indicate whether the participant fasted prior to blood collection. The markers evaluated in the urine were creatinine, potassium and sodium.
Cancer related variables
Age at cancer diagnosis, cancer location, cancerous tumor behavior, distinct cancer diagnoses, and whether the participant had previously been screened for CRC.
All pre-processing and data analysis steps (Fig. 1) were performed with Python 3. Participants whose consent date was not available (No = 2), who withdrew consent (No= 158), who had other primary cancers (No= 105,924) or concurrent cancers (No= 1,458) were excluded. We removed outliers outside the 0.1 and 0.99 percentiles. To simplify the interpretation of results, ethnicity was recoded as white and non-white, and education level as university and non-university. Summed MET minutes were grouped into 5 quintiles. To maximize the data available in the analyzes that follow, we opted for methods that can handle sparse data where possible and coded missing data into categorical measures as an ‘unknown’ category. We performed group comparisons on baseline data using two-tailed chi-square tests and between-sample t-tests and corrected for multiple comparisons using the false discovery rate. Analysis-specific pre-processing steps are provided in the respective sections.
Event time analysis
We used the Cox regression model to assess the effect of the above biomarkers on age at diagnosis and associated risk for CRC using the Lifelines package20. In addition to the filtering steps explained in pre-processing, we removed participants who developed some type of cancer prior to recruitment (No= 6740). All participants were followed up to the censoring date, February 29, 2020. We calculated survival based on participant age. Covariates used in the model were measures from the initial screening visit. Since Cox regression does not deal with missing data, rows with missing values were removed.
A data-driven approach was adopted to find the optimal set of covariates to model in Cox regression. We first split the dataset into 80% training and 20% test sets, stratified by label. We performed an advanced feature selection where each covariate was univariately fitted to the data set and any covariates that exceeded P> 0.10 has been removed from the feature list. We then test the dataset for multicollinearity by calculating variance inflation factors (VIF) and remove any covariates that exceed VIF> 10. Finally, we use backward feature elimination, where we initially model all remaining variables and then iteratively remove the variable with the greatest non-significance in P> 0.05, until all model covariates were significant. The final model was then evaluated with the test set. We report the concordance index (C-index) for the training and test sets as an index of model performance.
We use the Python implementation of GPBoostwhich combines powerful tree-growth algorithms with mixed effects models – which are commonly used when working with clustered data21.22, to classify CRC vs HC using initial and repeat visits (when available). The growing part of the tree uses Light GBMa highly efficient type of gradient-boosting decision tree, which deals with missing values and works with categorical variables23.
Here we use the same procedures for processing features and removing outliers as before. Unlike the survival analysis, we included prevalent and incident cases and kept incomplete lines. We assigned a binary label for CRC to each participant’s study visit based on whether they had developed CRC at the time of the visit. We first split the dataset into 80% training and 20% test sets, and then split the training set into 5-fold cross-validation (CV) for feature selection experiments. We evaluated the performance of the final model using the area under the receiver operating characteristic curve (AUC-ROC).
To determine the number of resources, we implement recursive resource elimination (RFE)24. RFE is a wrapper-like feature selection algorithm that recursively removes the least “important” features until the desired number of features is reached. To find a cost-effective combination of biomarkers, we performed a grid search with a maximum of 5 features, using the gain-based feature importance provided by our model. The ideal number of resources (Nresources) was then selected as the number of traits that achieved the highest quintuple CV score. We also calculated feature importance values for each split, normalized by total importance and aggregated across splits. We generated partial dependency plots (PDP) to assess the dependency of the predicted CRC probability on each feature.