We first provide a general outline of the methodology that we have developed to investigate diagnosis delays in HIV. We initially fit a model to biomarker data in terms of time since infection in a ‘calibration’ dataset of seroconverters in whom we have strong information regarding the date of infection; this is done in order to characterise the ‘natural history’ of the biomarkers in untreated patients. Using this fitted model, we can make inferences regarding the timing of infection in a seroprevalent patient given their observed biomarker data and date of diagnosis. In order to do this we also need to consider whether we can make any prior assumptions regarding the likely infection date before looking at the biomarker data, and one simple approach is to assume that the date of infection is equally likely for any point in time from the legal age of sexual consent until diagnosis (termed a ‘uniform prior distribution’). However, we further develop a method that explicitly models the average diagnosis delay within a group of patients using a survival distribution. This is all done within a Bayesian framework.
Biomarker models
The model for longitudinal observations of pre-treatment CD4 counts follows the structure as described by Stirrup et al. [
19]. Briefly, CD4 counts are modelled on the square-root scale, using a statistical model that includes random intercept and slope components, independent measurement error terms and a fractional Brownian motion stochastic process component. An interlinked model for pre-treatment VL measurements (on log10-scale) is used based on that proposed by Pantazis et al. [
20,
21]. The proportion of ambiguous nucleotide calls at first treatment-naïve viral sequence is modelled using a zero-inflated beta model. This effectively comprises a logistic regression model for the occurrence of no ambiguous calls and a model for a beta-distributed variable amongst those cases with any ambiguous calls observed.
CD4, VL and sequence ambiguity are all modelled in terms of the ‘true time elapsed from date of infection’ in each patient. For those patients in whom this is not known exactly, this variable is formed by the sum of ‘time from diagnosis to observation’ and an unobserved latent variable representing the delay from infection to diagnosis (denoted τ
i
for the ith patient). For the calibration dataset τ
i
is given a uniform prior distribution over an interval equal to the time between last negative and first positive HIV-1 tests in each patient, and for seroprevalent patients two different options for the prior are considered: a uniform prior or a prior implicit in a joint model for HIV incidence and delay to diagnosis.
We are interested in epidemiological analysis on a scale of months and years, and so do not distinguish between dates of infection and seroconversion. Further model and computational details are given in Additional file
1: Appendix A.
Individual patient predictions with uniform priors
The biomarker model fitted to the calibration dataset is used to generate distributions for the delay to diagnosis in seroprevalent patients. We approximate the posterior distribution for all of the biomarker model parameters resulting from the calibration dataset using a multivariate normal distribution, and then use this as the prior for these model parameters in subsequent analyses for new patients.
When evaluating the delay to diagnosis in each individual new seroprevalent patient, we initially use a uniform prior distribution for this latent variable (τ
i
), defined between zero and an upper limit equal to the time elapsed between the patient’s 16th birthday (or 1st Jan. 1980, whichever, is later) and the date of their HIV diagnosis. The model for the observed CD4 counts, VL measurements and sequence ambiguity in each new patient is dependent on the value of τ
i
as for the model fitted to the calibration dataset, although the range of possible values is wider. Information regarding the probable diagnosis delay is obtained by generating the posterior distribution of τ
i
for each patient given their observed biomarker data. We employ this approach to generate predictions for one patient at a time (i.e. separate statistical models are generated and processed for each patient, although this can be run in parallel for cohorts of patients using cluster computers).
Survival models for delay to diagnosis
In making population-level inferences, there is a problem that some patients have little biomarker data available or have biomarker values that only provide limited information regarding the timing of infection. We address this issue through the fitting of an exponential survival model for diagnosis following HIV infection. This approach enables information to be pooled across similar patients, and also allows direct investigation of patient characteristics associated with the delay to diagnosis in cases of HIV. In these analyses the approximate multivariate normal prior distribution for biomarker parameters resulting from the calibration dataset is used as previously described, but data from the entire subgroup of interest of newly observed seroprevalent patients are combined in a single statistical model.
The event time in the survival models fitted is defined as the time from HIV infection to diagnosis, once again specified as an unobserved latent variable (τ
i
) with value restricted to lie between zero and an upper limit equal to the time difference between the patient’s 16th birthday (or 1st Jan. 1980) and the date of their HIV diagnosis. However, the prior distribution of τ
i
is implicit in a statistical model for HIV incidence and diagnosis. As when using a uniform prior distribution for τ
i
, for each seroprevalent patient biomarker data are modelled in terms of the true time elapsed from date of infection and this allows a posterior distribution for the delay in diagnosis to be obtained that is conditional on this information.
We can, of course, only include patients in whom HIV has been diagnosed in the analysis, and so there is no censoring of survival times. However, for a cohort of patients diagnosed in any given calendar period there is both left and right truncation of the event times. In this setting it is also necessary to model the incidence rate of new HIV infections in the population of interest. We define the start and end of the study period as
T
L
and
T
R
, respectively, and denote the point in calender time of HIV infection in the
ith patient as
t
i
. The left truncation results from the fact that any given patient can only be included in the cohort if
T
L
<
t
i
+
τ
i
. The right truncation results from the fact that a patient will only be observed if
t
i
+
τ
i
<
T
R
. This situation is directly analogous to the problem of estimating the distribution of incubation time from transfusion-acquired HIV infection to AIDS, an important issue at the start of the HIV epidemic, in which there was left truncation of observations due to a lack of recording of very early AIDS cases and right truncation due to the fact that transfusion events leading to HIV infection could only be identified retrospectively upon the development of AIDS [
22]; we develop our model for the incidence rate of new HIV cases and the delay-to-diagnosis distribution based on the work of Medley et al. [
23,
24] in this previous context, and we use notation also based on that employed by Kalbfleisch and Lawless [
25].
Following Medley et al. [
23,
24] and Kalbfleisch and Lawless [
25], initiating events (i.e. HIV infections) occur according to a Poisson process for which the rate of new events is a function of time; in technical terms we define an intensity function for the process
h(
x;
α),
x>−
∞, where
x is a variable representing calender time and the intensity function
h(
x) is determined by parameter vector
α. We assume that the delay to diagnosis
τ is independent of the time of infection
x, with cumulative distribution function
F(
τ) and density function
f(
τ)=
dF(
τ)/
dτ. Medley et al. [
23,
24] and Kalbfleisch and Lawless [
25] considered the situation at the start of an epidemic, with observation of diagnoses at any point in time up to the end of the analysis (i.e. the period (−
∞,
T
R
]) and the first non-zero incidence at a defined point in time (set to 0). However, we are interested in modelling populations in later stages of the HIV epidemic and so only consider diagnoses occurring within a defined period [
T
L
,
T
R
], without specifying a start time for the epidemic. We do not consider the possibility that a new HIV infection is never diagnosed (e.g. due to death before diagnosis), but believe that the proportion of such cases would be very small in the population of interest. The joint log-likelihood (
ℓ) function, omitting dependence on model parameters, for the incidence and observation of HIV cases is then:
$$\begin{array}{*{20}l} \ell &= \sum_{i=1}^{n} \left\{\log \left(h\left(x_{i} \right) \right) + \log \left(f\left(\tau_{i} \right) \right) \right\} - A, \\ \text{where,}\ A&= \int_{-\infty}^{T_{R}} h\left(x \right) \left\{ F\left(T_{R}-x \right) - F\left(T_{L}-x \right) \right\} dx, \end{array} $$
$$\begin{array}{*{20}l} &= \int_{-\infty}^{T_{L}} h\left(x \right) \left\{ F\left(T_{R}-x \right) - F\left(T_{L}-x \right) \right\} dx \\&\quad+ \int_{T_{L}}^{T_{R}} h \left(x \right) F\left(T_{R}-x \right) dx. \end{array} $$
This matches the form of the expression used previously [
23‐
25], but the integral denoted
A is adjusted to reflect the truncated observation window and the lack of assumptions regarding the start of the epidemic.
It is noted by Kalbfleisch and Lawless [
25] that the absolute incidence of new infections can be eliminated from the joint likelihood function by conditioning on the total number of cases observed. However, it is still necessary to model the relative incidence as a function of calendar time unless constant incidence can be assumed at all points up until the end of the study period. The assumption of constant incidence might be justified for a completely stable endemic disease, but this condition is not common in the epidemiology of infectious diseases.
We are primarily interested in fitting a model for the delay-to-diagnosis distribution, but in doing so we are therefore required to model the incidence of new infections prior to and during the calender period under investigation. Ideally the function for the incidence of new HIV cases,
h(
x), would be chosen so as to provide a plausible representation of the entire epidemic. However, when attempting to fit models to data from patients diagnosed decades after the start of the epidemic, this is not a practical objective. Instead, we propose a pragmatic approach in which the incidence (
h(
x)) is assumed to be either exponentially increasing or decreasing prior to the calender period of interest (i.e. for
x<
T
L
), and to be either constant or in a separately defined state of exponential change during the period itself (i.e. for
T
L
<
x<
T
R
). We therefore define the incidence rate function as either:
$$\begin{array}{*{20}l} \text{1:} &h\left(x \right) = e^{\left(c + \delta_{1} \left(x \right) b\left(x-T_{L} \right) \right)}, \text{ or} \\ \text{2:} &h\left(x \right) = e^{\left(c + \delta_{1} \left(x \right) b\left(x-T_{L} \right)+ \delta_{2} \left(x \right) d\left(x-T_{L} \right) \right)}, \end{array} $$
where the function
δ1(
x)=1 if
x<
T
L
and 0 otherwise,
δ2(
x)=1 if
x>
T
L
and 0 otherwise, and
c,
b and
d are model parameters: exp(
c) is the incidence rate at
T
L
,
b determines the rate of decay (
b<0) or growth (
b>0) of incidence prior to this and
d (in ‘Option 2’) determines the change in incidence after
T
L
. For an exponential model for the delay-to-diagnosis distribution with rate parameter
λ, for
b+
λ>0 the integral required for the log-likelihood function can be solved analytically in each case (results in Additional file
1: Appendix B).
The functions that we have suggested for
h(
x) clearly cannot provide a full description of the HIV epidemic. However, we propose that allowing for an increasing or decreasing trend in HIV incidence directly prior to the period of interest will appropriately adjust for truncation of diagnosis dates as long as the function
h(
x) provides an adequate description across the probable range of infection dates of the patients included in the analysis. The first option presented assumes constant incidence of new HIV infections during the observation period, which may be appropriate for short analysis windows, whilst the second option also allows a change in incidence following the start of the observation period. Further computational details are given in Additional file
1: Appendix B.
Datasets and software used
We present analyses that make use of viral sequences of the protease and reverse transcriptase regions of the
pol gene collected by the UK HIV Drug Resistance Database [
26] that can be linked to pseudo-anonymised clinical records of patients enrolled in the UK Collaborative HIV Cohort (UK CHIC) [
27] and UK Register of Seroconverters (UKR) cohort [
2]. The statistical methodology was developed using a ‘calibration’ dataset comprising 1299 seroconverter patients from the UKR cohort who can be linked to a treatment-naïve partial
pol sequence. All patients included from the UKR cohort have an interval between last negative and first positive HIV tests of less than 1 year, and some patients were identified during primary infection, meaning that their date of infection can be treated as fixed and known. Injecting drug users were excluded from the analysis.
The methodology developed was then applied to a seroprevalent cohort of men who have sex with men (MSM) diagnosed with HIV in London over a 5-year period spanning 2009–2013 and enrolled in the UK CHIC study. We only included men with an age of at least 18 years at time of diagnosis with a treatment-naïve partial pol sequence stored in the UK HIV Drug Resistance Database. We also excluded any men enrolled in the UKR study. This led to a sample size of 3521 patients. Pre-treatment CD4 counts and VL measurements were included in the analysis, but were not considered as part of the inclusion criteria.
We employ a fully Bayesian approach, implemented in the Stan probabilistic programming language [
28]. We carried out all Bayesian modelling using a Linux cluster computer; although fitting individual models using a modern desktop computer would be feasible. The authors acknowledge the use of the UCL Legion High Performance Computing Facility (Legion@UCL), and associated support services, in the completion of this work. Maximum likelihood estimation of random effects models was performed using the
lme4 package for R; these were used in the CD4 back-estimation of infection dates performed for comparison.
Simulation analyses
To further investigate the properties of the methodology developed, we carried out several simulation analyses. Firstly, we generated data for 2000 hypothetical patients with unknown date of HIV infection without considering the truncation of observation times. For this purpose we set distributional parameters equal to the posterior mean values obtained when our model was fitted to the calibration dataset without the inclusion of lab-specific random effects and, to further simplify matters, data were only generated for white MSM with subtype-B HIV acquired at the age of 32. The delay from infection to diagnosis was set to follow an exponential distribution with rate parameter of 0.5 (on the scale of years). Nucleotide ambiguity proportions were simulated at the time of diagnosis, and CD4 counts and VL measurements were generated at time of infection, after 1.5 months, 3 months and subsequently at 6-month intervals from 6 months to 3 years. If a negative CD4 count was generated, then this and all subsequent simulated clinic visits were censored; this meant that a few simulated patients were excluded completely and so a new patient was generated to replace them. The limit of detection for VL was set to 50 copies/mL in the simulations. We initially used a uniform prior distribution for time of infection (from the patient’s 16
th birthday to date of HIV diagnosis) when generating predictions, and also fitted an exponential survival model for the delay to diagnosis pooled across simulated patients. Simple CD4 back-estimation based on a fitted ‘random intercepts and slopes’ model was used for comparison [
8].
Two additional simulation analyses were carried out with time-varying incidence and truncation of observation times. Patients were generated with characteristics, delays to diagnosis and scheduled set of viral sequence, CD4 and VL observations as described for the simulation of observations without truncation. However, incidence was varied over a measure of calender time, and patients were only selected for analysis whose simulated date of diagnosis fell within a specified analysis window. Models were fitted with estimation of both the rate of diagnosis parameter (λ) and incidence rate over calendar time, allowing the latter to vary before and during the analysis window. Firstly a simulated cohort was generated with increasing incidence from zero for 10 years prior to the analysis window, and a constant incidence rate of 200/year during the analysis window of 5 years’ duration. Secondly a simulated cohort was generated with constant incidence rate of 300/year for 10 years, followed by a decrease to 150/year over the 5 years prior to the analysis window and a further decrease to 100/year over the 5 years of the analysis window itself.