Strengths and limitations
To our knowledge, the new linked resource is unique, providing unprecedented population size and statistical power to study the effects of elements in soil on human health. The data provide comprehensive, prospective recording of health outcomes across a population of over 6 million individuals offering, in principle, the potential to study the effects (whether adverse or beneficial) of any soil constituent present in the linked dataset on the risk of any medical condition diagnosed by or reported to primary care practitioners.
The additional health care and lifestyle details recorded in EMRs provide us with the ability to adjust for a wide range of potential confounding factors which may cluster geographically, as do the prior linkages of the THIN database to measures of area-level socioeconomic status, air pollution, and land use. The wide range of soil constituent measures we have linked will permit adjustment for the presence of other elements which may also modify the risk of outcomes of interest, and enable us to assess the extent of effect modification due to the presence of elements which may affect bioavailability (as in the case of iron and arsenic) [
13].
The similarity of the soil constituent exposure levels observed among THIN patients to those that would be expected in the wider population suggests that studies using the linked resource are likely to produce generalizable results. Previous validation studies of the THIN database indicate that participants are representative of the population at large in terms of a range of sociodemographic measures [
36].
There are a number of limitations that may affect the utility of the linked database in practice. The sampling resolution of the surveys in G-BASE may conceal focal areas of high variability in soil constituents. Local heterogeneity is generally greater in urban areas, but this is superimposed upon systemically increased concentrations associated with the impact of urbanisation for elements such as lead and copper [
37,
38]. In urban areas, where the THIN population is concentrated, sampling density is high (4 per km
2) and work carried out during the completion planning for G-BASE suggests that improvements in estimate precision above the 1 per 2 km
2 level may be relatively small [
39], although this will vary from element to element.
Uncertainties always exist in the interpolation of values between points of measured concentration to make predictions at unsampled locations. We used the inverse distance weighting method as it is a relatively straightforward and widely understood approach that produces estimates primarily determined by the closest available sample site. Point estimates at the postcode centroid (rather than an alternative such as an average of all points within a postcode) were considered sufficient as UK postcode areas are small (especially relative to the distance between sampling sites): in urban areas each typically represents a small section of a street, or even a single large apartment building) and contains an average of 15 (range 1–100) individual mail delivery addresses [
40]. More sophisticated techniques (such as those based on machine learning) which incorporate information from additional mapping layers have been shown to improve precision in subsets of the G-BASE data [
41], however this is an ongoing area of research and such methods have not yet been applied or validated across the full survey area.
We cannot be certain that the presence of raised levels of a contaminant in the soil in the area where each patient lives directly translates into increased exposure among those patients; where patients work a long distance from home, consume little locally-produced produce, seldom engage in outdoor activities such as sports or gardening, or live in focal areas of severe contamination, the true exposure level may be markedly different. The presence of a substantial number of such individuals in the THIN population would tend to introduce random error. This would typically manifest as a null bias, so whilst it is unlikely to lead to the false identification of an increased risk, the magnitude of a true risk might be underestimated. The large size of the THIN population (and concomitant statistical power) will reduce the impact of such bias on our ability to detect raised risks, even in cases where we are unable to accurately quantify them.
The participation rate among practices in the Yorkshire and Humber SHA was low, which may restrict our ability to draw inferences about the risks experienced by patients in this area. In addition, there is a known bias towards arable land within the NSI(XRFS) sample collection areas (the survey was initially carried out to help assess agricultural potential). This issue primarily affects West Wales, where known examples of industrial land contamination are not detectable in the NSI(XRFS) dataset [
37,
42]. We are unable to distinguish between different compound forms of the elements included in the linkage, which may be problematic where toxicity or effects on bioavailability differ [
43]. For example, different forms of iron are known to differentially affect the bioavailability of arsenic in soils [
13]. It is likely, however, to be possible to at least partially adjust for this at area level; whilst we do not know the exact locations of patients or practices in the linked dataset, we do know to which Strategic Health Authority area each practice belongs, and the ratios between ironstones and other mineral forms of iron differ substantially between these areas [
44].
Whilst the THIN data are longitudinal, the G-BASE data are (although collected over an extended period) effectively cross-sectional, and the linkage has been carried out at a single point in time. The exposure levels assigned to each individual may not, therefore, be representative over the entire duration of follow-up. Previous research suggests that levels of most of the soil constituents included in the linkage are driven by (generally slow) geological processes and that levels are relatively stable over time, except in areas and for elements where there are significant ongoing inputs from industrial or agricultural activities [
45].
The linked measures are unlikely to accurately reflect long-term exposure for patients who have only been registered for a short time, however it should be possible to address this issue by carrying out sensitivity analyses restricted to patients who have been continuously registered for an extended period. In addition, THIN is updated quarterly, so the duration of follow-up available for the patients included in the linkage will increase over time. Movement of patients away from (and registration of new participants into) participating practices will, over time, reduce the proportion of patients for whom soil measures are available, requiring the linkage to be repeated. The patients who leave the database will be more likely to be those who are in highly mobile sociodemographic groups, somewhat reducing the demographic representativeness of the linked population, but at the same time preferentially removing those for whom point estimates of exposure are least likely to reflect lifetime exposure.
When linking geospatial and medical datasets there is, in each case, a need to make compromises in order to preserve patient confidentiality. The THIN/G-BASE linkage demonstrates a viable approach that provides high quality, individual-level data on a very large number of patients at the cost of limiting our knowledge of patient locations and the number of geochemical variables we were able to link (to avoid producing unique combinations which would make postcodes and patients readily identifiable). It is unlikely that linkages providing spatial information in sufficient detail for risk mapping and GIS analysis, or that incorporate richer information about soils (e.g., more constituents, or details of other soil characteristics that may influence exposure or bioavailability) would receive ethical approval in most jurisdictions unless either explicit patient consent was obtained (limiting the feasibility of assembling a large research population), or summary data on population health was used in place of individual patient records. This situation may improve in the near future, however, as emerging techniques for secure multi-party statistical analysis [
46] may enable multiple data-holders to carry out rich joint analyses without explicitly linking or sharing their datasets with one another and creating confidentiality concerns in the process.