Methods to compare
The following linear regression model is considered for the outcome
y
i
of subject
i with a covariate
x
i
where
ε
i
is random noise and
i = 1, ...,
n. The error
ε is assumed to be uncorrelated with
x and to have a mean equal to zero and a constant variance. The parameters
α and
β denote the intercept and the average change in
y with
x. By Ordinary Least Squares (OLS) the estimated slope and intercept of the regression line can be computed. However, in immunological data the
y s in equation (
1) are only partly observed. A lower threshold or detection limit, DL, interferes with measurements of low levels as follows:
Since NDs of cytokine measurements reflect levels of exposure, they cannot be considered as
missing at random (MAR) [
22]. Therefore, deleting the lowest values is expected to produce biased results. Other types of methods to analyze these data are imputation and modelling of NDs. An overview of the available methods is given in Table
1. In environmental statistics a method called robust regression on order statistics (ROS) approach exists [
1,
9]. This method is often used to compute summary statistics.
To reflect uncertainty about imputation, we propose to employ multiple imputation approach as introduced by Little and Rubin [
22,
23]. Based on a truncated normal distribution, we first compute the mean and the standard deviation. This can be done using the functions cenmle or ros from the R-package NADA [
24]. Then, the values for NDs were generated randomly and
m complete data sets are created and each data set is analyzed separately. Rubin (Chapter 3, [
25]) gives the following rule for combining the results. With
m imputations, we obtain
m different sets of the point estimate
as well as standard errors
s1, ...,
s
m
. The pooled MI point estimate is then simply the average of the
m estimates:
.
The variance estimate associated with
has two components. The
within-imputation variance can be estimated by the average of the complete data variance
. The
between-imputation variancem is the variance of the estimate
,
The
total variance is defined by
T =
Ū + (1 +
m-1)
B and inferences are based on the approximation
/
T-1/2 ~
t
ν
, where the degrees of freedom are given by
.
Finally, two non-imputation methods for incorporating NDs into regression models are investigated. Without adding uncertainty on the distribution of the NDs, the outcomes can be dichotomized and logistic regression can be applied. However, the relationship between the covariate and the outcome is now on a logit scale instead of a linear one. A more sophisticated approach is to use maximum likelihood estimation (MLE) method for left-censored data, called TOBIT model after the economist James Tobin [
26]. The model is written as a combination of
The probit part determines whether the outcome variable is below-DL, and the OLS part is a truncated regression model. The TOBIT model estimates a regression model for the data above DL, and assumes that the censored data (below DL) have the same distribution of errors as the observed data. The weakness of this method is that it may be more vulnerable to violation of the assumptions about the error distribution. Many comments can be found in the literature that in the presence of heteroscedasticity the Tobit estimates are inconsistent, and that there is only limited information about the direction of the bias [
20,
21].
Simulation study
We simulated data sets by drawing samples from a population similar to the example data in the Background section, and by allocating a proportion of observations as NDs.
For the covariate x (infection intensity) we used (1) a three-component normal mixture distribution, (2) a two-component normal mixture distribution, and (3) three classes. The three-component normal mixture distribution has means equal to 0.77, 3.35 and 4.59 and a within-component variance of 0.027. The proportions of the three components were 0.83, 0.13 and 0.04, respectively. The two-component normal mixture distribution has means equal to 0.77 and 3.69 and a within-component variance of 0.069, with their proportions 0.84 and 0.16, respectively.
Then, based on the characteristic of Cytokine 1, outcome variables were generated using the following regression model,
y
i
= 3.04 0 - 16x
i
+ ε
i
,
for individual i ∈ {1, ..., n}. Based on Cytokine 2, we generated outcome variables as
y
i
= 0.66 + 0.27x
i
+ ε
i
.
And, ε were assumed to be standard normally distributed.
Based on biology, the malaria parasite measurements lend to be categorized in three classes: negative, submicroscopic, and microscopic. Instead of looking at the effect of malaria with continuous measurements, we considered the categorical malaria variable, say z. The dummy code z
i
= (zi 1zi 2zi 3)⊤ denotes a vector of malaria category indicators for the i th subject, with elements z
ij
= 1 if i th subject has j th category; otherwise z
ij
= 0. The categorical covariate vector z were then generated following the multinomial distribution of categorized malaria status with proportions of 0.69, 0.14, and 0.17. Based on Cytokine 1, y were generated following the model:
y
i
= 2.97 - 0.13zi 2- 0.58zi 3+ ε,
while based on Cytokine 2
y
i
= 0.84 + 0.13zi 2+ 0.77zi 3+ ε.
Here ε were assumed to be standard normally distributed.
We then considered data samples of size n = 200, 400 and 1, 000. The proportions of NDs were set 10%, 30%, 50% and 70%. The corresponding cut-off points of DL values were: (1) for imitation of Cytokine 1, 14.7, 10, 17, and 29 pg/ml, and (2) for mimicking Cytokine 2, 0.7, 1.6, 2.7, and 4.6 pg/ml.
For studying the effect of heteroscedastic errors we used the same model as in (3) but now with a variance depending on the value of
x by using
ε ~
N(0,
).
Evaluation of methods
In general, accuracy of estimate can be evaluated by bias, which represents the closeness to the true values, and precision measures the ability to repeat a previous estimates (regardless of accuracy). The combination of both accuracy and precision of estimate can be investigated by the root mean square error (RMSE) as follows:
Therefore, parameter estimates provided by the various methods were compared in terms of mean bias and RMSE. Also coverage probability was provided, which is the probability that the confidence interval of the estimates contains the value. Additionally, for the unbiased methods performances were also compared for their hypothesis testing abilities in terms of power. The Wald-type statistic
/SE(
) was used for testing. It is approximately distributed as a
t-distribution with
n – 2 degrees of freedom for
n observations in each sample for continuous outcome.
All computations have been done using the program language R [
27].