Data
Diabetes patients were identified through the Piedmont Diabetes Registry among the residents in the North-Western Italian city of Turin (population: 896,914). The Registry is based upon anonymous record linkage between administrative databases, lists of exemptions from payment of drugs, hospital discharge records and prescription databases. Details on the identification of the population-based cohort are described elsewhere [
14,
15]. Italian citizens, irrespective of social class or income, are cared for by general practitioners and health care services are supplied by the National Health System (NHS). All drug prescriptions, outpatient treatment, diabetes related prescriptions of medical devices, such as test strips, syringes, and glucometers, hospital discharges and emergency room admissions are recorded by the Regional NHS Administrative Databases. Data registered from August 1st, 2003 to July 31st, 2004 were linked to the overall Turin population, making it possible to analyze health care services used by patients with and without diabetes (respectively
n = 33,792 and
n = 863,122). As previously described, we analyzed reimbursement tariffs set by regional and national government contracts [
14]. In the present study, data were used for tutorial purpose only and for their distributive characteristics. An update of the data was not included in the study aims.
The Piedmont Diabetes Registry is authorized to use administrative health care data for epidemiological purposes. Raw anonymous data are available upon request to the Authors.
Cost analyses
Effect of diabetes on mean annual NHS expenditure was analyzed over the entire cohort with several multivariable models, adjusted for age and gender.
First, we fitted one part models (Table
1), including: 1) an ordinary least squares regression (OLS); 2) a lognormal linear regression model; and 3) a generalized linear model (GLM) with gamma distribution.
Table 1
Determinants of annual healthcare costs, mean annual predictions and cost ratios (patients with vs. without diabetes), Root Mean Squared Errors (RMSE), by several data modeling approaches
One-part models |
|
Normal (€)
| 1,832.76 | 1,795.56–1,869.95 | 3348.6 | 3343.8–3353.9 | 831.2 | 829.8–832.4 | 4.03 | 3,342.4 |
|
Lognormal
b
(exp β)
| 6.0 | 5.84–6.16 | 6146.5 | 6116.9–6178.6 | 1343.6 | 1340.5–1347.0 | 4.57 | 3,670.0 |
|
Gamma (exp β)
| 2.6 | 2.56–2.67 | 3878.1 | 3867.0–3891.1 | 826.1 | 824.8–827.3 | 4.69 | 3,351.1 |
Two-part models |
Part 1 |
|
Logistic (OR)
| 2.40 | 2.18–2.64 | | | | | - | |
Part 2 |
|
Normal (€)
| 1,710.36 | 1,668.40–1,752.32 | 3392.0 | 3387.2–3397.5 | 1058.8 | 1057.4–1060.4 | 3.20 | 3,732.2 |
|
Lognormal (exp β)
| 3.3 | 3.21–3.32 | 4119.9 | 4104.4–4136.3 | 1175.2 | 1173.2–1177.6 | 5.60 | 3,760.6 |
|
Gamma (exp β)
| 2.2 | 2.21–2.28 | 3700.1 | 3690.0–3711.1 | 1050.8 | 1049.5–1052.4 | 3.50 | 3,735.6 |
Two part model (logistic + gamma) | | 3662.26 | 3652.07–3673.25 | 891.9 | 890.63–893.54 | 4.10 | 3,739.8 |
The OLS model relies on the central limit theorem whereby the mean of a sufficiently large sample will be approximately normally distributed, independently of the population distribution. It assumes a linear relationship between the cost accumulation and its possible determinants (such as sex, age, type of diabetes etc.), with an additive effect of the covariates – that is the cost is a function of the Variable 1 effect plus the Variable 2 effect plus the Variable 3 effect, etc., − and a normal distribution of the error term. As OLS regression is well known and easy to apply, it is attractive for researchers and widely employed. However, in presence of skewness in the distribution of the error terms, OLS is not robust enough and can estimate inaccurate standard errors and confidence intervals. To overcome the problem of skewness in the residuals, a commonly adopted approach is to model a log-transformation of the response variable (that is, costs) able to gain a reasonable normalization effect even in presence of highly skewed data. To obtain results in natural units (euros, dollars), the approach of transforming the costs in any case requires a back-transformation at the moment of interpreting results. Such back-transformation might cause several additional problems, partially avoided by using specific statistical approaches (like the “smearing” estimator) [
16]. In this analysis we applied the Duan smearing estimate [
17], that is the average of the exponential of the residuals from the OLS regression on the log-transformed costs. If
c
i
(
i = 1, …,
n) is the cost observed for each patient, and
x
j
are the
j(
j = 1, …,
h) covariates and
β
j
are the
j corresponding regression coefficients estimated with the OLS method, the smearing factor (Φ is:
$$ \varPhi =\frac{1}{n}{\displaystyle \sum_{i=1}^n \exp \left( \log \left({c}_i\right)-{x}_i\overset{\wedge }{\beta}\right)} $$
The exponentials of the predicted values were then multiplied by the smearing factor to obtain expected values on the original scale.
Moreover, in the present analysis the log transformed outcome variable was (cost + 1), as we needed to include all those subjects who had zero costs in the model also: in fact they could not be simply treated as missing cases, as they might convey relevant information on costs distribution among subgroups. It is common practice to add a constant to null values, when fitting log-linear regression models, in order to not exclude subjects from the analysis [
12]. This is an arbitrary choice, that could bias the relationship between cost and covariates. However, sensitivity analysis shows that the distributions of original and transformed cost, stratified by the covariates used in the models, are substantially overlapped (data not shown).
The GLM models are a generalization of the linear model which specifies the relationship between a dependent variable and a set of predictor variables and allows the response variable to have other than a normal distribution [
18,
19]. GLMs permit flexible modelling of covariates and enable inference to be made directly about the mean costs, rather than focusing on transformation methods. The relationship between the covariates and the mean of the dependent variable is described by the link function. The family specifies the distribution (such as normal, gamma, Poisson, etc.) that reflects the mean-variance relationship. As in most previous costs analyses, we used a Gamma distribution with a log link, that performs satisfactorily with distributions with zeroes and/or long right tails [
13]. Lognormal and GLM are both multiplicative models, because they are expressed in logarithmic scale. This means that, due to the algebraic properties of logarithms, cost is a function of the exponential of the multiplied variables, after retransformation in the original scale. Consequently, the comparison of the estimated costs cannot be directly interpreted, due to the scale of the model and the technique of retransformation used.
In the second group of analyses (two part models, Table
1), the zero costs presence was handled by means of a two part model [
20]: i) in the first part, a logistic regression was used to model the probability of incurring any cost over the one year period. The dependent variable was set equal to 1 in any subject who incurred costs, and was set equal to 0 in any subject who incurred in 0 costs. We also included covariates to adjust for age, sex and presence of diabetes. Odds ratios (ORs) for probabilities of using health services (i.e. of not providing zero cost) were then estimated; ii) the second part estimated the total accumulated costs, conditional on incurring any cost, by using the same set of three models applied in the one-part model group and described above. In this two-part set of analyses, the lognormal linear regression has not required to add 1 to the observed costs.
Cost ratios of patients with diabetes vs. those without diabetes were then estimated. Finally, estimated costs for patients with and without diabetes were calculated multiplying the expected probability of spending for health care by the estimated costs for people using health care (results shown only for gamma model).
Due to the uncertainty on the parametric assumption of the distributional forms, confidence intervals were calculated using a bootstrapping simulation process which is a data-based simulation method for assigning measures of accuracy to statistical estimates, used to produce inferences such as confidence intervals without knowing the type of distribution from which a sample has been taken. The bootstrap simulates what would happen if repeated samples of the population could be taken by constructing a number of resamples with replacement of the observed dataset. Standard errors of the parameter of interest are then estimated by the standard deviation of the parameter in the simulated samples. We extracted bootstrapping samples using a SAS System macro-generating 100 bootstrap random samples of patients [
21].
To assess the performance of each model, the root mean square error (RMSE) was computed for each model. RMSE is a frequently used measure of the difference between values predicted by a model and values actually observed, providing the models are expressed in the same unit of measure, as in our case [
20]. These individual differences are also called residuals, and the RMSE serves to aggregate them into a single measure of predictive power. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. A RMSE value closer to 0 is desirable.
All the analyses were conducted using the SAS System.