The correlation coefficient is a standard measure of association between two random variables and is widely used in epidemiology. As such, considerable attention has been given to its interpretation [
1‐
3] as well as to the methods for correcting attenuation due to random measurement error [
4,
5]. Strategies for correcting measurement error require knowledge about the reliability of the measurements [
2] for the use of an alloyed gold standard [
6] to estimate reliability coefficients. In many epidemiological studies, the reliability of the measurements is unknown making it impossible to correct for attenuation.
Classical methods are based solely on collected data, and ignore any prior knowledge of the association under investigation. The Bayesian approach is one alternative for estimating correlation coefficients in which knowledge from previous studies is incorporated to improve estimation. The purpose of this paper is to illustrate the utility of the Bayesian approach. The summarizing properties and correction for measurement error of the Bayesian approach will be demonstrated. To illustrate this method, the correlation between maternal weight gain during pregnancy and infant birth weight will be examined.
Statistical Methods
Bayes' Theorem holds that a prior state of knowledge offers relevant information for statistical analyses. To update beliefs about a hypothesis, Bayes' Theorem is used to calculate the posterior probability of the hypothesis, such as correlation coefficient ρ. As such, Bayes' Theorem [
7] holds that the posterior probability ρ is given by the following formula:
The factor P(data|ρ) is the likelihood function evaluated at ρ or the data collected from the investigator's study. The P(ρ) depends upon information present before the study, i.e., prior probability. The term 1/
P(
data) should be viewed as a factor that makes the total probability equal to 1 when adding over all possible ρ's; that is, the denominator P(data) is the sum or integral of the numerator over all ρ's. It is often referred to as the normalizing constant. Bayes' Theorem [
8] can be rewritten as such:
Posterior Probability ∝ Likelihood × Prior Probability (1.2)
where ∝ means proportional to.
We suppose that the two variables of interest, X and Y, follow a bivariate normal distribution with means μx and μy, variances σx and σy, respectively, and correlation coefficient ρ(x,y) = ρ. We will use the following conventional notation to represent the sample mean, variance and covariance:
and
Also, as a reminder, the sample correlation coefficient r is defined by:
Using standard reference priors for μ
x, μ
y, σ
x, and σ
y, and applying (1.2), a reasonable approximation to the posterior density [
4] of ρ is given by
After making the substitution ρ = tanh ξ and
r = tanh z, we find that ξ is approximately normal with mean z and variance 1/n. These results were derived in a series of complicated substitutions by Fisher [
10] and are described in detail elsewhere [
4].
One of the most important properties of the hyperbolic tangent transformation (ρ = tanh ξ and r = tanh z) is its capacity to take full advantage of the conjugate properties of the normal distribution, which is accomplished by combining correlation coefficients from different studies. As stated in (1.2), we need a prior and a likelihood function to find the posterior density, which will follow a normal distribution where:
μ
Posterior
= ς2
pos × (n
prior
× tanh-1 r
prior
+ n
Likelihood
× tanh-1 r
likelihood
) (1.5)
and variance:
In general, many different priors can be used in (1.4), but clearly the inference becomes easier if we choose a prior in the following form for c:
P (ρ) ∝ (1 - ρ2)
c
(1.7)
The choice of c is an important one, since it will determine the weight the prior will have in estimation. If we do not have any information from previous studies, a common choice for c will be 0, that is, p(ρ) ∝ 1. There are some other choices for c, such as -3/2 (referred to as the multiple parameter Jeffreys' rule) [
8]. A detailed description of this concept is beyond the scope of this paper and is discussed elsewhere [
9,
10].
Description of the Study Population
The New York State Angler Cohort Study was initiated in 1991 to characterize exposure to persistent toxic contaminants through the consumption of Lake Ontario sport fish in men and women of reproductive age. Potential relations between these exposures and various reproductive and developmental endpoints were also assessed. A description of the cohort and has been published elsewhere [
11]. Briefly, the New York State Angler Cohort Study employed a cross-sectional design to survey a stratified random sample of men and women between the ages of 18 and 40 who bought fishing licenses in 16 upstate New York counties in close proximity to Lake Ontario. Detailed information has been complied for the children born to cohort members between 1986 and 1991 and includes data from birth certificates and maternal and newborn medical records. Of the 2430 women with singleton index births during the study time period, 2205 (91%) had both medical records and birth certificates available with no missing data relevant to the study question.
Among the index study group of children, the prevalence of low birth weight (<2500 grams) and pre-term delivery (<37 weeks) were 3.3 and 3.7 percent, respectively. The mean birth-weight was 3503 grams and the mean gestational age was 39.7 weeks. The majority of women were white (98.8%) and were married at the time of delivery (92.6%). For the current study, we restricted our analysis to African American women (n = 26). In these women, the mean weight gain was 29.61 ± 10.86 pounds, and the mean infant birth-weight was 3484 ± 462 grams.
Implementation of the Bayesian Methodology
The correlation between maternal weight gain during pregnancy and infant birth-weight in African American women was estimated using data from the Angler Cohort Study. It is known that maternal weight gain in this study was measured with error. The sample correlation coefficient between maternal weight gain and infant birth weight was
r
xy
= 0.27 (n = 26). This estimate differs greatly from that of a previous study (
r
xy
= 0.63) in which the maternal weight gain measurements were performed more precisely and were based in a large sample (n = 1026)[
12].
We combined data collected in the Angler Cohort Study with information from the prior study using formulas (1.5) and (1.6). Specifically, we had a normally distributed prior and likelihood, which are conjugate functions. The posterior distribution then is normally distributed, with the following variance:
and mean
μ
Posterior
= ς2
pos × (n
prior
× tanh-1 r
prior
+ n
Likelihood
× tanh-1 r
likelihood
) =
0.0009 × (1026 × tanh-1 0.63 + 26 × tanh-1 0.266) = 0.691
That is, Normal(Mean = 0.691, Variance = 0.0009), resulting in a point estimate of the correlation coefficient of tanh(0.691) = 0.598.
Since we know the posterior distribution, we also can calculate the 95% posterior probability interval, which is defined by
that is, (0.63–0.75).
Using the hyperbolic tangent transformations, we obtained a corresponding interval for the posterior ρ (0.56–0.63). If we only based our conclusion on the collected data, the 95 percent confidence interval would be 0.27 ± 1.96 × (1/26)1/2 or (-0.11 – 0.65). This corresponds to a 95% confidence interval for the correlation coefficient of (-0.11 – 0.57) in the original scale.