Background
Methods
Data
Variable abbreviation | Description and definition | Measurement units |
---|---|---|
Response variables | ||
HBsAg | Hepatitis B Surface Antigen (marker of HBV infection) | Positive (1) or Negative (0) |
HepC | Patient antibody to HCV, indicating contact with virus (Both HBsAg and HepC detected by immunoassay)
| |
Explanatory variables | ||
Age | Patient (case) Age | Years |
Sex | Gender 1 = F, 2 = M | M or F |
ALT | Alanine aminotransferase; an intracellular enzyme released after liver and other tissue cell damage | U/L |
GGT | Gamma-glutamyl transpeptidase; an intracellular enzyme also relevant to liver damage | U/L |
Hb | Haemoglobin | g/L |
Hct | Haematocrit; formerly known as “packed cell volume” | % |
Mch | Mean corpuscular haemoglobin | pg/RBC |
MCHC | Mean corpuscular haemoglobin concentration | g/L |
MCV | Mean corpuscular volume | f/L |
Plt | Platelets; an agent in blood clotting | × 109/L |
WCC | White cell count | × 109/L |
RCC | Red cell count | × 1012/L |
Crea | Creatinine; excreted by filtration through glomerulus and tubular section | μmol/L |
K | Potassium; predominant intracellular cation whose plasma level is regulated by renal excretion | mmol/L |
ALKP | Alkaline Phosphate; found in liver, bone, intestine and liver | U/L |
ALB | Albumin; major component of plasma proteins | g/L |
TBil | Total Bilirubin levels are reflective of the rate that the body recycles the red cells in the blood; bilirubin is a breakdown product of old, spent red blood cells. | μmol/L |
Sodium | Sodium; predominant extracellular cation | mmol/L |
Urea | Blood urea; often used to detect kidney related infections. | mmol/L |
RDW | Red cell distribution width | % |
Neut | Neutrophils; white blood cells, elevated by bacterial infection and early viral infection | × 109/L |
Lymph | Lymphocytes; white blood cells, elevated by viral infection and some cancers | × 109/L |
Mono | Monocytes; white blood cells, elevated by infection, inflammation, and some cancers | × 109/L |
Eos | Eosinophils; white blood cells, elevated by allergy and parasite infection | × 109/L |
Bas | Basophils; white blood cell, elevated in hypersensitivity reactions | × 109/L |
Balancing
Scaling
Machine learning including predictor variable selection
HBV | |
Extract 9170 individuals with HBV recorded of which 172 positive, 8998 negative | |
Split data into training (70%) and testing (30%) with 120 positive and 6300 negative in each split | |
Either | Downsize the training data into 52 sets of 120 positive plus 120 negative |
Or | SMOTE the training data 400% oversampling and 100% under sampling leading to 52 sets of 3960 individuals with 1920 positive, 2040 negative |
Or | Multiply downsize the training data into 11 sets of 120 positive and 120 negative |
Then either | grow a random forest and pick the top five variables, apply SVM with the top five variables from the random forest |
Or | proceed straight to SVM |
HCV | |
Extract 7820 individuals with HCV recorded with 533 positive, 7287 negative | |
Split data into training (70%) and testing (30%) with 373 positive and 5100 negative in each split | |
Either | Downsize the training data into 13 sets of 373 positive, 373 negative |
Or | SMOTE the training data at 400% oversampling and 100% under sampling leading to 13 sets of 4797 individuals with 1492 positive, 1865 negative |
Or | Multiply downsize the training data into 11 sets of 373 positive and 373 negative |
Then either | grow a random forest and pick the top five variables, apply SVM with the top five variables from the random forest |
Or | proceed straight to SVM |
Analysis of variance
Results
Summary statistics
Variable | HBV positive (n = 172) | HBV negative (n = 8998) |
p-value | HCV positive (n = 533) | HCV negative (n = 7287) |
p-value |
---|---|---|---|---|---|---|
Sex | 34% female | 47% female | 0.0008a
| 36% female | 45% female | 0.0001a
|
Age mean (s.d.) | 40.5 (13.9) | 45.2 (18.7) | 0.0001b
| 40.6 (14.4) | 47.1 (19.2) | <0.0001b
|
HBV mean (95% CI) | SMOTE | SMOTE RF | Downsize | Downsize RF | MDS | MDS RF |
Fscore | 0.056 (0.054, 0.057) | 0.052 (0.050, 0.053) | 0.056 (0.054, 0.057) | 0.052 (0.050, 0.053) | 0.065 (0.061 0.068) | 0.059 (0.055, 0.063) |
Precision | 0.034 (0.032, 0.036) | 0.026 (0.025, 0.027) | 0.029 (0.028, 0.030) | 0.027 (0.026, 0.028) | 0.034 (0.032, 0.036) | 0.031 (0.029 0.032) |
Sensitivity | 0.625 (0.605, 0.645) | 0.611 (0.587, 0.634) | 0.625 (0.605, 0.645) | 0.611 (0.587, 0.634) | 0.246 (0.231, 0.260) | 0.675 (0.654, 0.680) |
HCV mean (95% CI) | SMOTE | SMOTE RF | Downsize | Downsize RF | MDS | MDS RF |
Fscore | 0.187 (0.179, 0.196) | 0.200 (0.196, 0.202) | 0.174 (0.170, 0.178) | 0.208 (0.200, 0.215) | 0.192 (0.190, 0.195) | 0.225 (0.220, 0.229) |
Precision | 0.134 (0.128, 0.140) | 0.117 (0.115, 0.119) | 0.103 (0.100, 0.105) | 0.124 (0.20, 0.129) | 0.115 (0.113, 0.117) | 0.138 (0.134, 0.141) |
Sensitivity | 0.311 (0.296, 0.326) | 0.668 (0.654, 0.682) | 0.567 (0.545, 0.590) | 0.625 (0.600, 0.650) | 0.589 (0.579, 0.598) | 0.610 (0.596, 0.623) |
Effect of balancing method
Precision source | SS | df | MS | F | p |
Method | 0.0004 | 2 | 0.0002 | 13.088 | 0.000 (a) |
Pre-processing | 0.0013 | 1 | 0.0013 | 80.504 | 0.000 (a) |
Method.Pre-processing | 0.0004 | 2 | 0.0002 | 12.222 | 0.000 (a) |
Sensitivity Source | SS | df | MS | F | p |
Method | 1.6151 | 2 | 0.8075 | 159.98 | 0.000 (a) |
Pre-processing | 1.9877 | 1 | 1.9877 | 393.78 | 0.000 (a) |
Method.Pre-processing | 2.8062 | 2 | 1.4031 | 277.97 | 0.000 (a) |
F score Source | SS | df | MS | F | p |
Method | 0.0011 | 2 | 0.0006 | 10.838 | 0.000 (a) |
Pre-processing | 0.0025 | 1 | 0.0025 | 47.154 | 0.000 (a) |
Method.Pre-processing | 0.0003 | 2 | 0.0002 | 3.007 | 0.052 |
Precision source | SS | df | MS | F | p |
Method | 0.0025 | 2 | 0.0013 | 32.843 | 0.000 (a) |
Pre-processing | 0.0011 | 1 | 0.0011 | 28.713 | 0.000 (a) |
Method.Pre-processing | 0.0064 | 2 | 0.0032 | 84.402 | 0.000 (a) |
Sensitivity Source | SS | df | MS | F | p |
Method | 0.0194 | 2 | 0.0970 | 114.86 | 0.000 (a) |
Pre-processing | 0.4375 | 1 | 0.4375 | 518.21 | 0.000 (a) |
Method.Pre-processing | 0.4162 | 2 | 0.2081 | 246.46 | 0.000 (a) |
F score Source | SS | df | MS | F | p |
Method | 0.0041 | 2 | 0.0021 | 25.546 | 0.000 (a) |
Pre-processing | 0.0114 | 1 | 0.0114 | 141.771 | 0.000 (a) |
Method.Pre-processing | 0.0019 | 2 | 0.0010 | 11.844 | 0.000 (a) |