Introduction

DNA methylation, tightly associated with alterations in the nucleosome DNA scaffold and coordinated chromatin alterations, is partially responsible for coordination of gene expression in individual cells [13]. Consequently, in the last decade there has been an interest in studying associations between DNA methylation and a wide variety of phenotypes. Because access to blood specimens is typically much more convenient to obtain from human subjects, the bulk of published studies have used whole blood (sometimes referred to as peripheral blood). A wide variety of phenotypes and health conditions have been studied: aging [48], cancer [912], obesity [5, 13], cardiovascular disease [14], prenatal exposures/perinatal outcomes [15, 16], environmental exposures [1724] (tobacco in particular [25, 26]), inflammatory diseases [27, 28•], psychiatric conditions [21, 2932], and fertility [33]. While many of these studies have used candidate gene approaches with bisulfite-pyrosequencing, an increasing number have conducted epigenome-wide association studies (EWAS) using commercially available microarrays such as the Infinium HumanMethylation450 BeadChip assay (“450K,” produced by Illumina, Inc.), its predecessor, the Infinium HumanMethylation27 BeadChip (“27K”), or an older Illumina product based on the company’s GoldenGate product.

Of recent concern has been the extent to which variation in DNA methylation is driven by cell composition effects rather than truly intranuclear processes. Normal tissue development, individual cellular differentiation, and cellular lineage determination are regulated by epigenetic mechanisms [2]. This necessarily means that DNA methylation shows substantial variation across tissue types [34, 35] as well as individual cell types, demonstrated particularly clearly among the distinct types of leukocytes [1, 36, 37, 38••, 39•, 40, 41]. Because variation in DNA methylation measured in the blood will necessarily reflect variation in constituent leukocytes, there is a concern that phenotype associations with cell composition will confound (or at least mediate) associations between DNA methylation and phenotype. An additional consideration is that endogenous and exogenous cellular stress can induce inflammatory signaling (arising, for example, in the endoplasmic reticulum of non-immune cells [42]); hence, changes in DNA methylation of stromal or non-immune specific cells affected by the malady of interest will almost certainly represent an important component of the immune response to the perturbed state of interest.

In this article, we review many of the recent studies that have examined associations with DNA methylation in the blood, highlighting cell composition effects through evidence of involvement in the immune or inflammatory responses and effects for which mediation by cell type can reasonably be ruled out. We then briefly discuss methods for mitigating the potential for confounding and potentially assessing mediation by cell composition.

Twin Studies

Several twin studies have examined DNA methylation in the blood. Kaminsky et al. (2009) applied a 12K CpG island microarray to assay whole blood DNA enriched for unmethylated cytosines using the methylation sensitive restriction enzyme HpaII [43]. The study compared 19 monozygotic (MZ) twin pairs and 20 dizygotic (DZ) twin pairs, matched for age, sex, and blood cell count (total, neutrophil fraction, and lymphocyte fraction) and found small but marginally significant differences in intraclass correlation coefficient (ICC) between the matched MZ and DZ twin pairs, suggesting that DNA methylation was slightly more concordant in MZ pairs than in DZ pairs, and thus the existence of very weak but possibly important genetic effects unexplained by proportions of major blood cell types. Boks et al. (2009) used the GoldenGate platform to analyze peripheral blood from 23 MZ twin pairs, 23 DZ twin pairs, and 96 controls matched for age and gender [6]. The study found age-related DNA methylation at loci suggestive of potential cell composition effects (due to their involvement in immune or inflammatory processes): IL6, CARD15, PDGFRA, and NFKB1. However, the study also found other loci potentially independent of cell composition effects: ACVR1 and ELK. Li et al. (2013) used the 27K array to analyze peripheral blood in 22 MZ twin pairs [44], identifying 92 CpGs that significantly varied between twins within a pair and speculating that the differences were driven principally by immune function.

Associations With Genetic Variants and mRNA Expression

One detailed study used blood to investigate epigenetic associations with single nucleotide polymorphisms (SNPs) and with mRNA expression. Van Eijk et al. (2012) used the 27K platform to study whole blood from 72 male adults and 76 female adults, with a mean age of 52, focusing in particular on associations with SNPs and gene expression [45]. The authors used structural equation models to determine causal directionality, finding that the most common three-way association was the traditional model wherein genetic variants regulate DNA methylation, which in turn regulates expression. They also found that expression modules, defined as “clusters of expression probes,” differed substantially from DNA methylation modules [45]. The authors also used Gene Ontology (GO), an informative tool that describes gene products with associated biological functions and cellular processes, in order to analyze the multiple modules in various GO terms. Interestingly, expression modules were enriched for GO terms suggestive of immune processes but numerous others as well (e.g., those involving transcription and translation). In contrast, compared with expression modules, fewer methylation modules showed enrichment for GO terms, with 5 of 12 enriched GO terms suggestive of immune response.

Aging

Several studies have investigated epigenetic associations with aging. In a candidate gene study, Alexeeff et al. (2013) pyrosequenced selected genomic targets in whole blood from 789 elderly participants of the Normative Aging Study [46]. Significant associations were found for loci mapped to INFG, IL6, TLR2, and iNOS(NOS2), suggestive of immune/inflammatory processes. The authors also found associations with Alu and LINE1 repetitive elements. Similarly, Madrigano et al. (2012) pyrosequenced selected genomic targets in whole blood from 784 elderly participants of the Normative Aging Study [7], finding strongly significant age associations with loci mapped to ICAM, IL6, INFG, iNOS (NOS2), and TLR2, suggestive of immune/inflammatory processes, but also found associations with loci mapped to genes that were not reflective of immune response: CROT, F3, GCR (NR3C1), and OGG. Note that F3 is involved in clotting and hemostasis and, thus, potentially reflective of signaling processes within the blood. Using the 450K array, Harris et al. (2012) examined mononuclear cells (MCs) from 55 children with Crohn’s disease [27], finding a single differentially methylated locus mapped to TEPP. The authors report that TEPP (testis, prostate, and placenta-expressed protein) is poorly expressed in whole blood, though they interpret the small but significant differences as questionable and potentially explained by differences in immune subsets. Alisch et al. (2012) applied the 27K assay to peripheral blood from 398 boys aged 3–17 years, confirming results via 450K in a second pediatric population (78 participants aged 1–16 years) as well as set of 1158 adult subjects [4]. The study reported enrichment for GO terms reflective of both developmental processes and immune function. Finally, Almen et al. (2014) examined age and obesity interactions using the 27K array to assay the blood from 46 adults [5]. Because interaction drove the authors’ filtering criteria, the loci demonstrating differential methylation were mapped principally to genes involved in metabolic function.

Cancer

Cancer biomarker studies have been well represented among investigations of DNA methylation in the blood. In a candidate gene study, Cassinotti et al. (2012) analyzed whole blood from 30 colorectal cancer (CRC) patients, 30 patients with adenomatous polyps, and 30 controls [9]. Samples were digested with the restriction enzyme Hin6I, and PCR was used to measure DNA methylation at loci mapped to 56 genes. Of six gene promoters selected as members of a biomarker panel for differentiating CRC patients from controls, one (PAX5) was reflective of immune response, while the others reflected cell cycle, tumor suppressor, or oncogene activity [CYCD2(CCND2), HIC1, RASSF1A, RB1, and SRBC(PRKCD8P)]. Of three genes selected for a biomarker panel to differentiate controls from patients with adenomatous polyps (HIC1, MDG1, RASSF1A), none were strongly suggestive of immune response. Flanagan et al. (2009) used a custom tiling array covering 17 candidate genes to analyze peripheral blood from 14 bilateral breast cancer patients and 14 matched controls, validating results via pyrosequencing in 190 cases and 190 controls [10]. The authors found that methylation differences were driven primarily by intragenic repetitive elements, one element associated strongly with lower ATM mRNA levels, thus reflecting a signal independent of cell composition effects. Using 27K, Teschendorff et al. (2009) analyzed peripheral blood from 148 healthy individuals and 113 age-matched pre-treatment ovarian cancer cases, developing a biomarker to distinguish cases from controls [12]. Results were highly suggestive of immune response (which the authors link to aging processes), but also developmental pathways potentially independent of cell composition effects. Similarly, using 27K, Marsit et al. (2011) analyzed the blood from 112 bladder cancer patients and 118 controls, finding that differentially methylated loci were enriched for pathways suggestive of immune response or developmental pathways [11].

Fertility

The epigenetics of infertility has also been studied using blood. Friemel et al. (2014) applied the 450K array to peripheral blood from 30 infertile males aged 27–42 years (median age 35.5) and 10 fertile males aged 21–52 years old (median age 39.5) [33]. They found differential methylation for PIWIL1 and PIWIL2, both genes important in spermatogenesis and the former of which may regulate hematopoietic stem cells. Significant loci were enriched for the MHC class II GO term and HLA genes, reflective of immune activity.

Pregnancy and Birth

Many groups have used blood to study epigenetic processes involved in various outcomes related to pregnancy, birth, and early childhood. Martino et al. (2011) conducted a longitudinal study using MC samples from seven females, collecting cord blood at birth and following up with blood samples through 5 years [47]; all samples were arrayed via the 27K array. Loci showing significant longitudinal change were enriched for cell surface receptor and signal transduction terms reflective of changes in immune response. While the authors applied FACS to MCs to determine fractions of major cell types (CD4+ T cells, CD8+ T cells, B cells, and monocytes), analyzed the resultant cell fractions for longitudinal changes, and demonstrated change over time, the authors did not include the fractions as potential confounders in analysis of DNA methylation. Relton et al. (2012) applied the GoldenGate assay to blood from 178 birth cohort subjects to investigate associations with body composition measures at 9 years of age [48]. The most robust association was found for CpGs mapped to ALPL, which is principally involved with bone density/skeletal growth. Other genes to which significant CpGs were mapped include CASP10, CDKN1C, EPHA1, HLA-DOB, IRF5, MMP9, MPL, and NID1, two of which implicate immune function (HLA-DOB and IRF5) and one involved in hemostasis (MPL). Liu et al. (2014) studied 308 African-American mother-infant pairs assaying cord blood via 27K [49] Loci showing significant associations with maternal pre-conception BMI were enriched for several infection and inflammation pathways, but also tumorigenesis and apoptosis pathways. Morales et al. (2014) applied the GoldenGate assay to cord blood samples from 258 birth cohort subjects, studying associations with pre-pregnancy BMI and gestational weight gain [50]. Among the 44 loci most significantly associated with weight gain were several immune and inflammation related genes (IL16, IL1B, IL8, NFKB1). NFKB1 in particular was selected as one of four genes the authors report as functionally relevant (others being MMP7, KCNK4, and TRPM5). Non et al. (2014) assayed cord blood via 450K, comparing 13 mothers with non-medicated depression or anxiety, 22 mothers taking SSRIs, and 23 controls [30]. GO terms enriched for loci differing between controls and non-medicated mothers with depression or anxiety consisted mostly of terms related to transcription and translation of DNA, although methylation differences were small; the authors found no loci associated with SSRI use. White et al. (2013) assayed blood via 27K to compare 14 pregnant women having preeclampsia to 14 normotensive controls [15]. None of the 19 top hits were suggestive of immune or inflammatory processes, although none were significant after adjusting for multiple comparisons. Sanders et al. (2013) studied cadmium exposure among 17 mothers with blood assayed using the methylated CpG island recovery assay (MIRA) [22]. Enriched GO terms were not only reflective of cell cycle and cancer pathways but also OF immune response.

Environmental Exposures

In addition to the cadmium study mentioned above, numerous publications report epigenetic associations in the blood with environmental exposures. Bind et al. (2012) studied 704 elderly male subjects (mean age 73.2) by pyrosequencing candidate genes in the blood and examining associations with traffic-related pollutants [18]. They found significant interactions of DNA methylation and air pollution on C-reactive protein and fibrinogen for loci mapped to TLR2 (suggestive of immune effects) as well as for loci mapped to F3 (related to hemostasis) and for Alu and LINE-1 repeats. Madrigano et al. (2012) pyrosequenced 1377 blood samples for loci mapped to iNOS (NOS2) and GCR/NR3C1 [21]. Acknowledging a potential cell composition effect, the authors found associations with black carbon and PM2.5 for NOS2 (reflective of immune effects) but not for GCR. Kile et al. (2013) pyrosequenced NOS2 and repetitive elements in the blood from 38 welders, finding PM2.5 associations with NOS2 but not with repeats [19]. Alegria-Torres et al. (2013) pyrosequenced several genes in the blood from 39 male brick manufacturers, finding associations with polycyclic aromatic hydrocarbons (measured in urine) and DNA methylation of loci mapped to IL12, TNFA, p53, and Alu repeats [17]; note that IL12 and TNFA reflect immune response. Tarantini et al. (2013) pyrosequenced several targets in the blood from 63 steel workers, studying association between DNA methylation and particulate matter assumed to be rich in metals [23]. They found associations not only between PM10 and DNA methylation at loci mapped to NOS3 (reflective of immune response) but also between zinc exposure and methylation at loci mapped to EDN1 (not reflective of immune response).

Several studies have, in particular, investigated associations with exposures to tobacco smoke. Using the 27K array, Breitling et al. (2011) studied associations between DNA methylation and smoking in the blood from 177 subject [25]. Associations were found for loci mapped to F2RL3 (reflective of hemostasis) and replicated in 316 independent samples. Via 450K, Zellinger et al. (2013) studied smoking associations using the blood from 1793 subjects [26]. They found associations at loci mapped to AHRR (related to detoxification and not to immune response) and replicated the association in 479 independent samples. Somewhat related to smoking, Qiu et al. (2012) studied chronic obstructive pulmonary disease (COPD), using 27K to study the blood from two family-based cohorts (n = 1085, n = 369), investigating associations between DNA methylation and COPD [51]. They report associations for loci mapped to SERPINA1 (related to hemostasis) and FUT7. Neither gene is reflective of immune response or inflammation, although they report that 349 significant loci were enriched for GO terms reflective of these processes, as well as others (wound healing and coagulation cascades as well as response to stress and external stimuli).

Psychiatry

Finally, several studies have investigated associations between DNA methylation and psychiatric conditions. Using the 27K array, Nishioka et al. (2012) compared 18 schizophrenics with 15 controls [29], finding 603 differentially methylated CpG sites, many of these mapped to genes critical in neuronal differentiation and related to other psychiatric disorders, as well as genes functionally related to those previously found to be differentially methylated in schizophrenic patients. Enriched GO terms emphasized transcription factor binding and nucleotide binding, but neither immune response nor inflammation. In a candidate gene study employing pyrosequencing, Rusiecki et al. (2013) compared 75 post-traumatic stress disorder (PTSD) cases with 75 controls [31]; they found differences in loci mapped to IL18 (reflective of immune response) and to H19. In another candidate gene study pyrosequencing 82 candidate genes, Zhang et al. (2013) studied childhood adversity in alcohol-dependent patients, stratified by race (African-American vs. European American), for a total of 518 cases and 369 controls [32]. Significant loci were mapped to metabolic genes or those related to neurotransmission, but the custom panel was heavily biased towards such genes.

Overview of Studies That Assay Blood Without Adjusting for Cell Composition

Most of the studies reviewed above found associations near genes that were reflective of immune response or inflammation (Table 1). In addition, many studies found associations near genes that were involved in hemostasis/coagulation, which are coordinated and closely regulated by immunologically active cells and related cytokines. Most of the studies that report no associations with immune or inflammation processes used panels that were heavily biased towards other biological processes (e.g., the GoldenGate Cancer Panel measures methylation across CpG sites in promoter regions of genes with known implications in cancer, and thus, it may not provide an unbiased assessment of the epigenome). The evident associations near genes reflective of immune response or inflammation suggest the potential for phenotype associations with leukocyte cell composition. Consequently, it stands to reason that the associations reported in many of the studies reviewed above may be confounded by cell composition effects as well as, potentially, by localized expanded numbers of cells with activated immune signaling pathways (Fig. 1). This point has been made recently by several authors [52••, 53]; in particular, Jaffe and Irizarry (2013) suggest that many of the observed associations between DNA methylation and age may be substantially confounded by cell composition effects [52••] (Fig. 1a). On the other hand, a number of the studies found associations near genes that were not obviously related to immune function or inflammation, so there remains a great potential for epigenetic signals that arise independently of cell composition (Fig. 1c).

Table 1 Overview of various studies’ findings related to DNA methylation associations that may or may not indicate for cell composition effects
Fig. 1
figure 1

Directed acyclic graphs (DAGs) of various scenarios in which cell composition can play a role in confounding or mediating the methylation associations, not being involved in the pathway or being involved in reverse causality. a Cell composition acts as a confounder of the association between DNA methylation and outcome of interest (i.e., certain disease). Varying amounts of different cell types may influence the differential methylation patterns expressed and also have direct impact (i.e., immune activation or inflammatory responses) related to disease. In most of these studies with implications for immune-related associations, cell composition effects have not been properly adjusted for and thus may bias the methylation-associated results. Dotted green lines indicate the need to adjust for cell composition effects. b Cell composition effects are involved as mediators in the pathway associated with disease. Cell composition can either influence or be influenced by DNA methylation in its association with the outcome. c The association of DNA methylation patterns with disease may not be influenced by cell composition effects, but rather by other biological mechanisms. d Diseased states may potentially influence DNA methylation patterns and/or cell compositions in the blood and thus indicates potential limitation of studies due to reverse causality

Strategies for Avoiding Confounding by Cell Composition

Multiple strategies have emerged for avoiding confounding by cell composition. The most direct method is to fractionate leukocytes and either to study a single cell type or, alternatively, to statistically adjust for directly measured cell counts or proportions. For example, Lam et al. (2013) argue that it is critical to account for granulocyte proportion in EWAS studies, either by removing them and arraying MCs, or minimally, adjusting for them statistically [54]. Nestor et al. (2014) compared eight seasonal allergic rhinitis (SAR) patients with eight controls, arraying sorted CD4+ T cells using the 450K platform. The results showed clear separation in methylation profiles, and along with gene expression profiles, emphasized interleukin genes related to lymphocyte activation. However, we note that Th1 and Th2 cells differ in production of INFG, IL2, IL4, IL5, IL6, IL10, and the two cell types are differentially methylated in the promoter of INFG [41]. In general, lineage-specific DNA methylation regulates differentiation of T cell subsets, with DMRs present in cell type-specific genes (FOXP3, IL2RA, CTL4A, CD40LG, INFG) [40]. Thus, even in isolated cell types, DNA methylation associations may be confounded by subtle cell composition effects, including those related to cellular memory and activation state.

In general, cell sorting is difficult and error-prone. Though conventional complete blood cell (CBC) count methods are routine and inexpensive, they can only differentiate major leukocyte types (lymphocytes, monocytes, and neutrophils) in freshly isolated blood (Fig. 2a). More subtle distinctions, such as characterization of differences among Tregs, NK cells, or dendritic cells, can be made using FACS analysis on whole blood or blood cell fractions (Fig. 2b). However, FACS analyses require extensive logistical support and fresh blood samples and are costly; hence, FACS is infrequently used in large clinical or epidemiological studies. At the same time, activated immune cells will exist in many disease states; for example, activated NK cells, dendritic cells, monocytes, macrophages, etc. are the hallmarks of inflammatory conditions. While their numbers are likely to be small in the periphery, they will have very different DNA methylation signatures at particular loci. Consequently, if these cells are numerous enough, these differences will be detected as very small differences in the beta coefficients (mean methylation values).

Fig. 2
figure 2

Various strategies to control possible confounding by cell composition. a Complete blood count tests can be performed to determine the levels of white blood cells, red blood cells, and platelets. b Flow cytometry staining and fluorescence-activated cell sorter (FACS) may be performed to isolate specific immune cell types. Major limitations of these methods are that they are labor intensive and often costly. c Previously established statistical algorithms can be applied to control for cell composition effects (c adapted with permission from: Koestler et al.: Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics 8(8):816–26 (2013). doi: 10.4161/epi.25430) [55]. i By projecting the methylation values from an experimental data set to a reference library of DNA methylation signatures for major immune cell types (i.e., B cells, T cells, granulocytes, monocytes, and NK cells), the estimates of specific cell(s) proportions in the blood can be determined. ii The methylation signatures for experimental samples are the weighted sum of the methylation signatures from distinct white blood cell types, where the weights are proportional to the specific cell-type frequencies in the blood. Illustration of the blood cell mixture deconvolution approach reproduced (and slighted altered) from [55]. The deconvolution approach involves (i) constrained projection of DNA methylation profiles from a target methylation data set (S1) onto a reference data set (S0), which is compromised of the DNA methylation signatures for isolated white blood cell types (shapes reflect different white blood cell types). The result is an estimate of the underlying distribution of cell proportions (circle, triangle, and hexagon) for each sample within S1

One increasingly popular method of addressing cell composition effects is to adjust for them statistically. Liu et al. (2013) demonstrated that DNA methylation associations with rheumatoid arthritis are explained principally by cell composition effects [28•]. In this study, they imputed the proportions of major cell types using a deconvolution algorithm whose use is becoming increasingly popular [38••]; this algorithm deploys information gained from a reference panel of sorted cell types, which in principle could be expanded to include profiles for rare leukocyte types, although the algorithm may be limited by the sensitivity of the assay used to measure DNA methylation (Fig. 2c). Another alternative is a new reference-free deconvolution method [56]; although its use seems promising, as evidenced by a recent study that used it to find biomarkers for Wilms tumors [56], its robustness remains to be confirmed.

Conclusions

Due to its convenience, the blood is commonly used in epigenomic studies, but its heterogeneous nature leads to interpretation difficulties. Many publications now report significant associations between DNA methylation and a variety of health conditions or exposures; typically, studies that have used genome-wide platforms have found associations indicative of immune response or inflammation, which almost certainly represent effects that are mediated by cell composition. Nevertheless, since many of these studies also report associations with other processes that are not easily explained as cell composition effects, the blood is still a valuable tissue to assay. Indeed, important biological insights may be gained from studying the cell composition effects themselves, not to mention their potential interactions with other processes.

Direct measurement of counts of various leukocyte cell types would be the ideal method for conducting such analysis; however, it is generally expensive and logistically difficult to measure these counts in a large study population beyond the standard CBC differential used clinically. Fortunately, there are statistical methods available for generating approximate cell proportions imputed from reference data; while these are necessarily inferior to directly measured counts, they may often represent an acceptable tradeoff between bias and feasibility.

Clearly, DNA methylation studies of blood tissue and of all other tissues should include a detailed consideration of the epigenomic features that are intrinsic to the cells that make up the tissue. Our knowledge of the differentially methylated loci in leukocyte subtypes, although far more extensive than even a few years ago, is still incomplete. This fact means that many DNA methylation associations attributed to intrinsic epigenetic processes are likely due to subtle effects on cell type composition involving cells that have yet to be characterized. Current human leukocyte libraries account for only approximately seven subtypes, but numerous specific subtypes may exist when one counts activation states of many common types [57, 58]. Memory versus naïve T or B cells, subsets of activated NK cells as well as different forms of dendritic and myeloid cells have yet to be studied. Current research is underway to characterize these types. As more data sets become available to characterize the epigenomic variation among different and less common functional subtypes of leukocytes, there will be an interest in applying these reference data in EWAS. However, success in such application will require the use of technologies that are more sensitive than the currently used microarray platforms, so we anticipate that in the future digital sequencing technologies will play a more prominent role in the conduct of such studies.