Introduction

Selection pressures from pathogens are known to impact on the worldwide geographic distribution of genetic variants in certain pathogen-response-associated genes. Such gene-specific effects could potentially lead to confounding by geographic disease associations. We wished to determine if such constraints impact on the genetic structure of more homogeneous populations.

The Irish population lies at the extreme of Europe, with some evidence that it lies at the end of cline of gene frequencies across Europe.1 There is also evidence that clines of gene frequencies exist across Ireland. Anthropological studies in the 1950s showed that there was clear evidence of a gradient of genetic variation across Ireland, for example the west of Ireland has a higher freckle density than in the east.2 Early serological studies of the ABO and Rh blood groups showed that the west of Ireland has the highest frequency of the blood group O in Europe (>75%), with the frequency of group A being highest in the east of the country (>30%). This eastern region corresponds well with the area of greatest settlement by the Anglo-Normans, in which population there is a high frequency of the A group.3 The frequency of Rh negative was found to be high in the east and lower in the west, ranging from 12 to 20%. However, a more recent study of genetic structure of the British Isles concluded that there was very little spatial genetic structure when studying several genetic systems including the ABO blood groups.4 A study of Y chromosomal variation also showed an east-west pattern, with haplogroup 1 (R1b3)5 at a frequency of 98.3% in the western-most Irish province Connaught, which declines easterly to 73.3% in Leinster.6 Therefore, it has been shown that for some systems there does appear to be population structure in Ireland, suggesting the need to test for such structure in the investigation of any candidate gene investigations of disease in Irish populations. We set out to assess the geographic variation in Irish autosomal polymorphisms and whether such variation was greater for pathogen response genes.

One feature of Irish population structure has been the collapse of the population during the Great Irish Famine (1847–1851). Gene frequencies are more likely to be dominated by small bottlenecks during population foundation, rather than by the halving over 60 years caused by death and emigration. Nevertheless, it is of interest to investigate whether gene frequencies might be associated with the distribution of patterns of mortality during this extreme event. Some genetic variants have been shown to influence susceptibility to certain diseases that were prevalent during the Great Irish famine, such as tuberculosis (TB).7 We therefore contrasted the allele frequencies of SNPs associated with TB response8, 9, 10 with documented mortality attributed to TB both during the Great Famine7 and more reliable mortality figures at a later time point in the early 20th century,11, 12, 13 to investigate whether genetic background had a detectable influence on historic TB mortality.

Materials and methods

Subjects

There were 1600 patients in all, 1163 of whom participated in a study of the genetics of acute coronary syndromes (ACS)14 and have either suffered from myocardial infarction, unstable angina or stable angina, and 437 of whom are part of a study on the genetics of early-onset ACS who have suffered from either myocardial infarction or unstable angina before the age of 55 (males) or 60 (females). Subjects have provided information about the Irish county of origin of each of their grandparents. These data were used as the geographic variable, so any grandparents who came from outside Ireland or whose Irish county of origin was unknown were excluded. For example, if the origin of only two of the four possible grandparents was known, these data were used in the analysis. This reduced the data set to 5645 grandparents (Table 1).

Table 1 Description of each of the 23 Irish sample locations, indicating latitude, longitude, number of individuals and Fst values

Geographic information

Each of the 32 counties of Ireland, from both Northern Ireland and the Republic of Ireland, were represented in the data set. The latitude and longitude were obtained for the most central location of each of the counties (Table 1). Counties with low representation (less than 50 grandparents) were merged, creating a final data set of 23 sample locations. While this ascertainment procedure does not provide an unbiased estimate of allele frequencies in Ireland (since candidate genes may differ in frequency in this disease population compared to the general population), cardiovascular disease is at a high incidence in all counties, and therefore the differences between grandparental counties of origin are likely to be highly representative of the differences seen in the population as a whole.

SNPs and genotyping

Each patient has been genotyped for 25 different SNPs in genes that can be categorised into two groups: those selected as candidate genes that may modulate platelet-mediated thrombosis,15, 16, 17, 18, 19, 20, 21 and those where there is documented evidence in the literature that the SNP variant appears to modulate significantly responses to pathogens in population studies8, 9, 10, 22, 23, 24, 25, 26, 27, 28, 29 (Table 2).

Table 2 Details of the 25 SNPs

Genotyping was carried out using the Amplifluor™ method by K Biosciences (www.kbioscience.co.uk). As part of quality control, 87 duplicate samples were genotyped. The majority of the SNP assays had a 0% error rate. The HLA-A SNP had an error rate of 1.1%, the SULT1A1 SNP had an error rate of 2.3% and the TLR4 exon 3 T399I SNP has an error rate of 1.1%. This gives an overall error rate of 0.2%. Tests for Hardy–Weinberg equilibrium (HWE) were also carried out using the G statistic.

Statistical analysis

F st

Fst is a standardised measure of the genetic variance among populations.30 It is a measure of heterozygote deficiency, reflecting the probability that two alleles drawn at random are identical relative to the combined population. Fst is proportional to the variance in allele frequencies between populations. It was calculated for each SNP over all sample locations, and a multilocus Fst was calculated for each sample location. The pathogen response SNPs as a group were compared to the thrombosis SNPs as a group, and the differences between the groups were assessed using the Kolmogorov–Smirnoff test. Weir and Cockerham F statistics were estimated using FSTAT software31, 32 according to the following equation:

where δa is the among-sample variance component, δb is the between-individual within-sample variance and δw is the within-individual component. We used Mantel test statistics33 to evaluate the correlations between the multilocus Fst values of all SNPs with the corresponding distance in kilometres between each of the sample locations.

Correspondence analysis

Correspondence analysis34 is a statistical ordination method that collapses the major source of correlation variation within a group of variables into an axis (CA axis 1), and similarly defines subsequent axes explaining any residual variation. Correspondence analysis was performed using ADE4 software for the R statistical package.35 Minor allele counts for each SNP in each sample location were used as the input.

Spatial autocorrelation

The spatial autocorrelation coefficient36 Moran's I was used to test whether the allele frequencies are independent of the allele frequencies at a neighboring location within specified distance classes. The significance of SA correlograms corrected for multiple tests.37 Spatial autocorrelation analysis was performed using PASSAGE software.38 For 1-D correlograms, both statistics are plotted against distance classes. 2-D (Windrose) correlograms take into account compass bearings as well as distance and are used to see if the data are anisotropic. The five distance classes for the 1-D analysis had upper limits of 84, 127, 167, 219 and 385 km. The five distance classes for the 2-D analysis had upper limits of 40, 85, 160, 265 and 400 km.

TB data

Three of the pathogen response SNPs have been shown in other populations to modulate susceptibility to TB. These are the SLC11A110, 39 SNP, the VDR FokI8, 40 SNP and the IL12RB19 SNP. Death rates from TB during the famine year of 1847 on a per county basis obtained from the census of 18717 were used to look for any correlations between such death rates and frequency of the susceptibility alleles of the SNPs. As such data have been shown to be unreliable,12, 13 they are only analysed as an exploratory exercise. We also investigated the more reliable death rates on a county basis obtained in the Irish Registrar Generals' decennial summaries for the years 1901–1910.11 This report detailed death rates from several different forms of TB, as well as death rates from all forms of TB per county. Correlations were sought between the allele frequencies of each of the three variants and the death rates from TB per county per 100 000 of population in the years 1901–1910. Stepwise multiple regression analysis41 considered whether the TB mortality rates per county correlated with allele frequency of the three SNPs, considering latitude, longitude, NW-SE and SW-NE directional trends and correspondence analysis axis 1 as covariates (successively dropping terms from the model, with P=0.10 as the cutoff for retaining terms).

Results

Hardy–Weinberg equilibrium

The two SNPs that appear in significant Hardy–Weinberg disequilibrium are the ABO SNP and the HLA-A SNP. To determine if this could be accounted for by regional variation among sample locations, a local HWE test was summed over the 23 locations (23 d.f.). The HLA test was no longer significant, implying that the variant is at equilibrium once regional variation was taken into account. The ABO variant was still highly significant: this was not simply a consequence of small expectations for some calculations, since it remained significant (P=0.001) when calculated using Yates' correction. The genotype frequencies are 85.2% GG and 14.8% GA in the overall population. There were no subjects who were homozygote for the minor allele A. It is possible that there is further unmeasured substructure for this variant in the Irish population, or that some other factor is impacting on HWE. Independent genotyping by a different method would be required to confirm this.

Fst values

The average Fst value for the 25 SNPs was 0.004, which is very low, given that the average Fst value calculated using more widely sampled populations is greater than 0.120.42, 43 This indicates little variation among the SNPs among the 23 sample locations. However, pathogen response SNPs tend to have higher Fst values (mean 0.004, SE ±0.0007) than the thrombosis SNPs (mean 0.003, SE ±0.0003), but this difference is not significant (P=0.210).

Multilocus Fst values across pairs of sample locations for the 25 SNPs over the 23 sample locations were calculated, and the means for each county are shown in Table 2. It is noticeable that the counties containing larger urban settlements (Antrim, Dublin, Galway, Limerick, Cork) have a low mean Fst (0.004–0.005), consistent with a homogenisation of gene frequencies in urban areas. Certain counties (Offaly, Wexford) have a high mean Fst (0.008–0.009). This may relate to substantial and persistent patterns of settlement in the Wexford area after the Anglo-Norman invasions of the 12th century,44 and to the settlement of Offaly from England and Scotland by 17th century plantations44 and/or a low rate of migration between these counties and other areas. There were very small and insignificant correlations between the multilocus Fst values between each sample location and the distance in kilometres between each sample location for all SNPs (r=0.05, P=0.279), for the pathogen response SNPs (r=0.068, P=0.241) and for the thrombosis SNPs (r=−0.014, P=0.504).

Correspondence analysis

Correspondence analysis displays the relationship between the 23 sample locations and between the 25 SNPs simultaneously. For ease of visualisation, the SNPs have been displayed separately (Figure 1). The first axis accounts for 16% of the variance and the second axis accounts for 12% of the variance. Axes 3, 4 and 5 accounted for 11, 10 and 9% of the variance, respectively.

Figure 1
figure 1

Correspondence analysis of the 25 SNPs. Axis 1 is proportional to darkness of shading in Figure 2a and axis 2 is proportional to the darkness of shading in Figure 2b.

We plotted the values of the first, second and third axes for the 23 sample locations on a map of Ireland. These maps show some general trends (Figure 2a–c), with axis 1 showing an NE-SW trend and axis 2 showing an NW-SE trend, although with notable exceptions. The first axis indicates that the South-Western County, Kerry, may represent the most distinctive extreme with the eastern seaboard (Wexford, Dublin and Antrim) lying at the other extreme. Wexford, consistent with this county having a slightly more distinct genetic make-up, showed sharp contouring with neighbours in each plot and dominated the third axis, consistent with its relatively high Fst value (Table 2).

Figure 2
figure 2

Map of correspondence analysis. (a) Axis 1 values, showing an approximate NE-SW gradient. (b) Axis 2 values, showing an approximate NW-SE gradient. (c) Axis 3 values.

On both the first and second axes of Figure 1, there are three pathogen response SNPs that stand apart from the rest. They are the CXCL12 SNP, the HLA-A SNP and the ABO SNP, which are all at quite low minor allele frequencies in the population.

Spatial autocorrelation analysis

Figure 3 displays the 1-D correlogram for the average of the 25 SNPs, for the 15 pathogen response SNPs and for the 10 thrombosis SNPs. The low Moran's I values (largest I=0.05, smallest I=−0.1 on a scale of –1 to 1) indicate low levels of spatial autocorrelation, either positively or negatively. This indicates lower genetic population structure than seen in a study of the British Isles, which itself only showed minor spatial genetic structure.4

Figure 3
figure 3

1-D spatial autocorrelograms using the Moran's I statistic. Distances indicated are upper limits of the distance classes. —•— Average of 26 SNPs; —▪— average of 15 pathogen response SNPs; —— average of 10 thrombosis SNPs.

Pathogen response SNPs showed a slightly stronger correlation between nearby regions, although the patterns observed for each group of SNPs were largely the same (Figure 3).

The main pattern to emerge across all SNPs from the 2-D autocorrelation was largely the same in both the pathogen response and other SNPs (Figure 4a–c). The general pattern is positive spatial autocorrelation in a northwest to southeast direction and negative spatial autocorrelation elsewhere, indicating a general northeast to southwest gradient of allele frequencies across Ireland. This is consistent with the northeast to southwest trend seen in the first axis of the correspondence analysis. While there are some differences, overall there is no strong indication from this analysis of a radically different pattern in pathogen response variants compared to the others.

Figure 4
figure 4

(a) Average 2-D spatial autocorrelogram of the 25 SNPs. (b) Average 2-D spatial autocorrelogram of the 15 pathogen response SNPs. (c) Average 2-D spatial autocorrelogram of the 10 thrombosis SNPs. In each segment, the average value for I is shown. Inner annulus represents 0–43 km, second annulus represents 43–172 km and third annulus represents 172–387 km. Black segments are positively spatially autocorrelated (0<I<1) and white segments are negatively spatially autocorrelated (−1<I<0).

Test for correlation between allele frequencies and TB deaths

Using linear regression analysis, there is a significant association between the east-west gradient (longitude) and TB deaths from 1901–1910 (P=0.017) and also for TB deaths during the famine year of 1847 (P<0.001). No statistically significant correlations were found between the TB deaths from 1901–1910 and any of the allele frequencies per county level (α=0.05). There is a suggestive correlation between the SLC11A1 allele (r=−0.29, P=0.18), the VDR FokI allele (r=0.07, P=0.75) and the IL12Rb1 allele (r=−0.23, P=0.29) with TB deaths. There is a significant correlation between TB deaths during the famine year of 1847 and the SLC11A1 C allele (r=0.42, P=0.05), although this is not significant for either of the other two SNPs (P=0.11 for VDR FokI and P=0.59 for IL2Rb1). Stepwise regression analysis considered whether TB deaths from 1901–1910 were influenced by the frequencies of the minor allele for any of the three SNPs, also considering correspondence analysis axis 1 and directional trends as dependent variables in the model. This found no significant predictors for the TB death rates from 1901–1910, although correspondence analysis 1 (P=0.018) and correspondence analysis 3 (P=0.001) are strong predictors of TB deaths during the famine year of 1847 (Figure 5b), which may largely reflect the association of certain genetic backgrounds with densely populated eastern areas.

Figure 5
figure 5

(a) Death rates from TB from 1901–1910 from the registrar generals' report. (b) Death rates from TB during the famine year of 1847. Black: 400–500 deaths per 100 000; dark grey: 300–400 deaths per 100 000; light grey: 200–300 deaths per 100 000; white: <100 deaths per 100 000.

The analysis considered up to four counties for each person, which are clearly not independent. For the data exploration aspects of this study, this does not introduce any particular bias. For the hypothesis testing aspects, an appropriate statistical correction would be required. However, since all the P-values were not significant, no such adjustments are necessary.

Discussion

This analysis shows that pathogen response variants, which we anticipated might show a stronger population structure than other variants, show a trend towards slightly stronger structure, but that this trend is very weak and not significant. Geographic trends appear very similar for the two groups of SNPs. The overall pattern of genetic variation is suggestive of a trend distinguishing the SW and midlands from the north and east, but this is not a dominating pattern, since a secondary weaker independent trend from NW to SE is indicated from the correspondence analysis. Much larger sample sizes of subjects will be required to define such trends more accurately. One county, Wexford at the southeastern tip, has a distinct pattern, perhaps consistent with its history as the earliest entry point (1169 AD) for the substantial and persistent migrations from Britain, which are a major characteristic of Irish historical demography.45 Some indication of the durability of the early Wexford settlement can be found in the persistence of a distinctive early English dialect from the time of the Anglo-Norman invasion until the middle of the last century.46

How informative are the grandparental origins in relation to genetic structure of Ireland? The industrial revolution had a lesser impact in Ireland than in the UK, with a predominantly rural agricultural economy well into the mid-20th Century. Census data indicate that of the 8.1 million strong population in 1841, only some 400 000 were living outside the counties in which they were born, with most migration to the urban centres of Dublin and Belfast.47 Therefore, the choice of grandparental counties of origin as an estimate of allelic origin should capture a significant proportion of the genetic structure in the Irish population, outside of the main urban regions.

A previous study has shown that the intron 4 G/C variant in the SLC11A1 gene is associated with TB susceptibility. While we found that this SNP was associated with TB mortality in the year 1847 during the Great Irish Famine, this association is not significant after correction for multiple testing. It may be that TB deaths in the Famine and in later historic periods investigated were dominated by nongenetic factors.

Our study indicates that selection pressures on pathogen response modulating variants have not created a markedly different population structure within a homogeneous Caucasian population. A separate issue is whether the allele frequencies themselves have been influenced by selection. Our analysis cannot answer this question, except to note that the impact of such selection pressures, if they exist, has not been confined to a particular geographic area.

What are the implications for problems of confounding genetic associations with disease through population structure? In general, it appears that selection may play quite a weak role in altering allele frequencies of common polymorphisms within a homogeneous population. The Irish may represent a good population to study weak genetic risks conferred by genes involved in disease responses with low danger of confounding through population structure. Although the population structure is small, it is detectable, and this supports proposals that such structure should be corrected for in analysis of weak genetic associations with disease.48 Even if some random genes showed marginally less structure than those genes such as HLA-A2, which show slightly greater structure, the trend will be in the same direction, and a reasonable evaluation of the data will allow for confounding caused by genetic substructure.