Statistical methods
The demographic characteristics of the women were presented using the descriptive statistics such as mean ± standard deviation (SD). The Kolmogorov-Smirnov test was utilized to assess the normality of the distribution for each biomarker’s result.
When the true disease status is unknown, the traditional statistical methods for assessing the diagnostic accuracy of test are not valid. These methods assume the existence of a GS test that has perfect sensitivity and specificity. In the past decades, several studies have been proposed different statistical techniques as a general solution to the problem of not having a GS assessment. Of these, latent class modelling has been extensively used in medical science, specifically in test accuracy research. This modelling approach which relates the observed results of diagnostic test to the latent disease status, can provide valid estimates of accuracy measures in the absence of a perfectly accurate disease status classification [
22].
In the current study, the Bayesian latent class model was applied to correctly classify women into clinically meaningful subgroups. Firstly, the fitted model for each of the biomarkers is described in the following paragraph according to a GS (i.e., OGTT) that assumed is not available [
28]:
Assume
Yi denotes the results of experimental continuous biomarker for subject
i (
i = 1, 2, …, 523). Let
di be the latent variable that indicates the results of the unobserved gold standard reference test based on disease status of the
ith individual (1: presence, 0: absence). If biomarker’ scores are normally distributed (even after a suitable transformation), then latent class model can be defines as (without covariate):
$$ {\displaystyle \begin{array}{c}{d}_i\sim \mathrm{Bernoulli}\left({\pi}_i\right)\\ {}\left(\left.{Y}_i\right|{d}_i\right)\sim {g}_1{\left(\left..\right|{\mu}_D,{\sigma}_D^2\right)}^{d_i}{g}_2{\left(\left..\right|{\mu}_{\overline{D}},\sigma \frac{2}{D}\right)}^{1-{d}_i}\end{array}} $$
(1)
where
μD and
\( {\mu}_{\overline{D}} \) are the means, and
\( {\sigma}_D^2 \) and
\( {\sigma}_{\overline{D}}^2 \) are the variances for the normal models of biomarker’ outcome for disease (D) and non-diseased (
\( \overline{\mathrm{D}} \)) populations, respectively. Also,
g1(.) and
g2(.) are the probability density functions
\( N\left({\mu}_D,{\sigma}_D^2\right) \) and
\( N\left({\mu}_{\overline{\mathrm{D}}},{\sigma}_{\overline{\mathrm{D}}}^2\right) \), respectively. One the other hand,
πi denotes the probability of a disease such that
P(
di = 1) = 1 –
P(
di = 0) =
πi. Meanwhile, in the absence of a GS, may be the model lacks identifiability. Hence, to achieve model identifiability, we assume that
\( {\mu}_D>{\mu}_{\overline{\mathrm{D}}} \). Furthermore, to determine how close the distribution of
YD to the distribution of
\( {Y}_{\overline{D}} \), we used the measure (Δ) proposed by choi et al. Note that when Δ is large (near to 0.5 or greater than 0.5), overlapping is increased between diseased and non-diseased group. Under this condition, the proposed method may not work well and convergence problems occur [
28,
31]. After obtaining the model parameter estimates, the ROC curve for cutoff values
c ∈ (−∞, ∞) based on single biomarker can be constructed by plotting
\( \left(1-\mathrm{spesificity}(c),\mathrm{sensitivity}(c)\right)=\left(1-\Phi \left(\frac{c-{\mu}_{\overline{D}}}{\sqrt{\sigma_{\overline{D}}^2}}\right),1-\Phi \left(\frac{c-{\mu}_D}{\sqrt{\sigma_D^2}}\right)\right) \), where 1- specificity and sensitivity are referred to false positive probability and true positive probability, respectively. In addition, ϕ is the cumulative distribution function of a standard normal for the biomarker’ scores. Finally, the corresponding AUC which is a measure of the overall performance of a diagnostic test, can be calculated as
\( AUC=\Phi \left(-\frac{\mu_{\overline{D}}-{\mu}_D}{\sqrt{\sigma_{\overline{D}}^2+{\sigma}_D^2}}\right) \). This measure can take on any value between 0 and 1. Notably, the closer AUC is to 1, the better the ability to discriminate between subjects with and without a disease.
In order to estimate θ = (π, μD, \( {\mu}_{\overline{D}},{\sigma}_D^2,{\sigma}_{\overline{D}}^2\Big), \) we employed Bayesian approach. We assumed non-informative prior distributions for all of the parameters. For μD and \( {\mu}_{\overline{D}} \), and for \( 1/{\sigma}_D^2 \) and \( 1/{\sigma}_{\overline{D}}^2 \), normal and gamma priors were selected, respectively. Besides, for π, dirichlet prior was chosen. Additionally, the Markov chain Monte Carlo (MCMC) method was utilized to obtain the Bayesian estimated parameters according to the posterior distribution. The mean, standard deviation, and 95% credible interval (CrI) as the posterior summary measures were employed. Meanwhile, we applied Monte Carlo (MC) error which is the computational accuracy of the mean. The convergence of the MCMC technique can be assessed by various criterion as well as autocorrelation diagnostic plots. If the autocorrelation within chains is not high, this may be satisfactory evidence for convergence.
Second, we combined the three markers into a single composite diagnostic test based on the model proposed by Yu et al. in 2011 [
30]. First, we considered different double linear combination of biomarkers for diagnosis. Then, the linear combination of the three biomarker results was examined. At this stage, for evaluation of classification accuracy of marker combinations, we used covariate-adjusted ROC curve.
Let
Yi = (
Yi, 1, …,
Yi, k)
′ denote the
k-dimensional vector of multiple correlated tests; such that
Yi, k denote the diagnostic result of the
kth test (
k = 1,…,
K) when applied to subject
i in a random sample of 523 subjects generated from normal distributions. Adjusting the covariates, the eq.
1 can be generalized on the latent true disease status as follows:
$$ {\displaystyle \begin{array}{cc}{d}_i\sim \mathrm{Bernoulli}\left({\pi}_i\right)& \left(i=1,2,\dots, 523\right)\\ {}\left(\left.{Y}_i\right|{d}_i,{x}_i\right)\sim \mathrm{MVN}\left(\mu \left({x}_i,{d}_i\right),{\Sigma}_{d_i}\right)& \end{array}} $$
(2)
where probability of being diseased (
π) follows a logistic model:
$$ p\left(d=1\right)=\pi =\frac{\mathit{\exp}\left({\alpha}_s{\mathrm{x}}_{is}\right)}{1+\mathit{\exp}\left({\alpha}_s{\mathrm{x}}_{is}\right)}={\alpha}_0+{\alpha}_1{\mathrm{x}}_{i1}+\dots +{\alpha}_s{\mathrm{x}}_{is} $$
where
α = (
α0,
α1, …,
αs)
′ is the vector of coefficients. Because we found that maternal age and BMI may play an important role in helping to discern GDM status, these variables utilized as disease and test covariates. x
i = (1, x
i1, …, x
is)
′ indicates the covariate vector of an individual. The test scores follow a multivariate normal (MVN) distribution. The model for disease status and the three marker results for GDM data are given by:
$$ \mathrm{logit}\left(P\left({d}_i=1\right)\right)={\alpha}_0+{\alpha}_1\kern0.2em \mathrm{maternal}\kern0.2em {\mathrm{age}}_i+{\alpha}_2\kern0.2em {\mathrm{BMI}}_i $$
$$ E\left({Y}_k\right)=\mu \left(\mathrm{x},d\right)={\beta}_0^k+{\beta}_1^k\kern0.2em \mathrm{maternal}\kern0.2em {\mathrm{age}}_i+{\beta}_2^k\kern0.2em {\mathrm{BMI}}_i+{\beta}_3^k\kern0.2em {\mathrm{disease}}_i+{\beta}_4^k\kern0.2em \left({\mathrm{disease}}_i\times \mathrm{maternal}\kern0.2em {\mathrm{age}}_i\right)+{\beta}_5^k\kern0.2em \left({\mathrm{disease}}_i\times {\mathrm{BMI}}_i\right), $$
where \( {\beta}^k=\left({\beta}_0^k,{\beta}_1^k,{\beta}_2^k,{\beta}_3^k,{\beta}_4^k,{\beta}_5^k\right) \) is the corresponding vector of regression coefficients.
For generating the composite test, a linear combination of the biomarkers (
Yi∗ =
a′Y) was employed. The optimal vector of linear combination is calculated as
a = (Σ
0 + Σ
1)
−1Δ(x) in which Δ(x) =
μ(x, 1) − μ(x, 0). The combined AUC (cAUC) based on combined test scores can be estimated as
\( \Phi \left(\sqrt{a^{\prime}\Delta \left(\mathrm{x}\right)}\right) \). In addition, the covariate-adjusted combined ROC (cROC) curve for a given cut-off point value c is constructed by computing
$$ \left(1-c\mathrm{Spesificity}\left(\mathrm{c}|\mathrm{x}\right),c\mathrm{Sensitivity}\left(c|\mathrm{x}\right)\right)=\left(1-\Phi \left(\frac{c-{a}^{\prime}\mu \left(\mathrm{x},0\right)}{\sqrt{a^{\prime }{\Sigma}_0a}}\right),\Phi \left(\frac{a^{\prime}\mu \left(\mathrm{x},1\right)-c}{\sqrt{a^{\prime }{\Sigma}_1a}}\right)\right). $$
We independently specified \( \mathrm{MVN}\left(0,\mathrm{I}{\sigma}_{\alpha}^2\right) \) prior for α in which I is the identity matrix, \( \mathrm{MVN}\left(0,\mathrm{I}{\sigma}_k^2\right) \) prior for βk, and Wishart(ν, Γ) prior for Σd such that ν and Γ are degrees of freedom and scale matrix, respectively. To examine the convergence of the MCMC samples, autocorrelation plots and Geweke’s diagnostic test were used. Further, optimal marker combination for making diagnosis identified based on the largest estimated AUC.
For the Bayesian data analysis, the software package R2OpenBUGS in R software was made (
https://cran.r-project.org/web/packages/R2OpenBUGS). Likewise, for the Geweke diagnostic, we used the coda library in R (
http://www-fis.iarc.fr/coda). After obtaining the parameter estimates, differences in maternal age and BMI variables between GDM groups were evaluated using a Mann-Whitney U test.
P values less than 0.05 were considered statistically significant. The statistical programing R software, version 3.5.1, was utilized for the univariate analyses (
http://www.rproject.org).