The outcomes of part two of the statistical questionnaire are summarized in the following three subsections covering study design, statistical methods and the presentation of results.
Study design
A number of key themes emerged from the analysis of the questionnaire data. Foremost amongst these were the description of the study design, identification of the experimental unit, details of the sample size calculation, the handling of missing data and blinding for subjective measures. These topics are discussed individually below.
The experimental unit is a physical object which can be assigned to a treatment or intervention. In orthopaedics research, it is often an individual patient. However, other possibilities include things such as a surgeon, a hip or a knee. The experimental unit is the unit of statistical analysis and, for simple study designs, it is synonymous with the data values, i.e. there is a single outcome measure for each experimental unit. For more complex designs, such as repeated measures, there may be many data values for each experimental unit. Failure to correctly identify the experimental unit is a common error in medical research, and often leads to incorrect inferences from a study [
1,
8].
The experimental unit was not identified correctly in 23% (15–33%; 95% confidence interval based on normal approximation to binomial) of the sampled studies. Of the 77 papers that correctly identified the experimental unit, 86% (75–92%) correctly summarised the data by patient. By far the most common reason for incorrect identification of the experimental unit was confusion between limbs and individual patients when analysing and reporting results. For example, one paper reported data for 100 patients but summarised outcomes for 120 feet, whereas another reported patient pain scores after surgery for both left and right ankles on some patients and single ankles for other patients. Failure to identify the correct experimental unit can lead to ‘dependencies’ in data. For example, outcome measures made on left and right hips for the same patient will be correlated, but outcome measures between individual hips from two different patients will be uncorrelated. Only one paper, where data were available from one or both legs for patients, identified this as an important issue and the authors decided to use what we would regard as an inappropriate strategy by taking the mean of the two values as the outcome for bilateral patients. Almost all of the statistical analyses reported in these studies (e.g. t-tests, ANOVA, regression) are based on an assumption that outcome data (formally the residuals) are uncorrelated; if this is not the case then the reported inferences are unlikely to be valid.
The size of the sample used in a study, i.e. the number of experimental units (usually patients), largely determines the precision of estimates of study population characteristics such as means and variances. That is, the number of patients in the study determines how confidently we can draw inferences from the results of that study and use them to inform decisions about the broader population of patients with that particular condition or problem. In clinical trials, a pre-study power analysis is usually used to estimate the sample size [
9], although methods are available for many other study types [
10]. It is particularly important for RCTs, where specific null hypotheses are tested, that a clear description of the methodology and rationale for choosing a sample size is given. For example, the outcome is assumed to be normally distributed, treatment group differences will be assessed using a
t-test, the power to detect a defined clinically meaningful difference is set to at least 80% and the type I error rate, or significance level, is set to 5%.
The sample size was not justified in the Methods section for 19% (7–39%) of the 27 papers describing RCTs. A specific calculation, with sufficient details to allow the reader to judge the validity, was not given for 30% (14–50%) of RCTs. These studies often simply described the sample size in vague terms, for instance
"…based on a priori
sample size estimation, a total of 26 patients were recruited…". For 3 papers reporting RCTs, the validity of the sample size calculation was questionable, for 3 papers there was a lack of clearly stated assumptions and in 2 papers the calculation was simply not credible. For example, one paper gave sparse details about the population variance, minimum clinically important difference and required power which resulted in a recruitment target of 27 patients for a two arm trial. For purely practical reasons one would always want an even number of patients in a two arm trial. In another paper, 400 patients were recruited to a study, based on a vague description about how this number was arrived at, and exactly 200 patients were randomly allocated to each of two treatment groups. A cynical reader might question the likelihood of such an exact split of patients between treatment groups; there is only a 1 in 25 chance of an exact split for a simple 1 to 1 randomization. However, this might simply be a case of poor reporting, where in reality blocking or minimization were used to equalise numbers in the treatment arms, thus giving more credence to the description of the design. For the 73 observational studies, only 34% (24–46%) justified the sample size, that is there was some discussion in the paper on how the sample size was arrived at; this was often minimal, for instance a simple statement that the number of patients required to answer the research question was the number of patients who were available at the time of study, or those who accepted an invitation to participate (e.g.
"…all patients were invited to join the study…").
Missing data are observations that were intended to be made but were not made [
11]; the data may be missing for unexpected reasons (e.g. patient withdrawal from a study), or intentionally omitted or not collected. It is important to carefully document why data are missing in the study design when reporting. If data can be considered to be missing at random, then valid inferences can still be made. However, if values are missing systematically, then it is more dangerous to draw conclusions from that study. For example, if in a clinical trial comparing different types of hip replacement all of the missing data occurs in one particular arm of the trial, the remaining data is unlikely to be representative of the overall result in that group of patients; the missing data may be because those patients went to another hospital for their revision surgery.
Data were missing, either for a complete unit or a single observation, in 34% (25–44%) of the papers, of these 34 papers only 62% (44–77%) documented and explained the reasons for this. An audit of the data reported in each paper allowed the statistical assessors to identify 13 papers (13% of the total sample) where data were missing with no explanation. Data missingness was generally inferred from the numbers reported in the results being less than those reported in the methods, with no explanation or reason offered by the authors of the study. In the 34 papers reporting missing data, 28 based the analysis on complete cases, 2 imputed missing data and for the remaining 4 papers it was unclear as to what methodology was used.
(iv)
Subjective assessments and blinding
Many orthopaedic studies report subjective assessments, such as a pain or a functional score after surgery or a radiological assessment of the quality of a scan. To reduce the risk of bias for these kinds of assessments it is desirable, where possible, to ‘blind’ the assessor to the treatment groups to which the patient was allocated.
Subjective assessments were undertaken in 16 of the 27 RCTs (59%; 95% CI 39–77%) and in 6 of these studies (38%; 95% CI 16–64%), the assessments were not done blind and no explanation was given as to why this was not possible.
Statistical methods
Statistical methods should always be fully described in the methods section of a paper and only the statistics described in the methods should be reported in the results section. In 20% (13–29%) of the papers in our survey, statistical methods not previously stated in the methods section were reported in the results section [
2]. In addition to the poor reporting of the methods used, a number of specific issues were identified.
The most commonly reported statistical methods were chi-squared (χ2) and Fisher’s exact tests (47%; 95% CI 37–57%), t-tests (45%; 95% CI 35–55%), regression analysis (33%; 95% CI 24–43%) and Mann–Whitney tests (28%; 95% CI 20–38%). The selection of an appropriate method of analysis is crucial to making correct inferences from study data.
In 52% (32–71%) of papers where a Mann–Whitney, Wilcoxon rank sum or Wilcoxon signed rank test was used, the analysis was considered to be inefficient and the reported analysis was only considered to be correct 70% (50–86%) of the time. The
t-test was used inappropriately, with a lack of robustness, in 26% (14–41%) of papers and in an equivalent proportion of papers (26%; 95% CI 14–41%) it was reported in such a way as to be irrelevant to the stated aims of the paper. This lack of relevance was, on occasion, due to method selection such as the choice between a parametric and a nonparametric test, but more often was simply a result of poor reporting and lack of clarity in the description. Many papers reported a list of the statistical tools used in the analysis, but in the results gave only short statements such as
“A was better than B (p = 0.03)” with no details as to which test was used to obtain the p-value; so-called ‘orphan’ p-values [
12]. It was therefore impossible to assess whether the correct test was used for the relevant comparison.
Seven papers (7%; 95% CI 3–14%) reported clear methodological errors in the analysis. Two papers wrongly used the Wilcoxon signed-rank test to compare independent samples and another paper used an independent samples
t-test where a paired test should have been used. One paper committed the reverse error of using a paired
t-test to compare cases and controls in an unpaired case–control study and another paper used a
t-test to compare differences in proportions rather than, for instance, a χ
2 test. Another study calculated the arithmetic mean of a number of percentages, all based on different denominator populations. And finally, one study outlined reasons for conducting a non-parametric analysis in the methods only to later report an analysis of covariance, a parametric method of analysis based on assumptions of normality.
(ii)
Parametric versus non-parametric tests
Parametric statistical tests assume that data come from a probability distribution with a known form. That is, the data from the study can be described by a known mathematical model; the most widely used being the normal distribution. Such tests make inferences about the parameters of the distribution based on estimates obtained from the data. For example, the arithmetic mean and variance are parameters of the normal distribution measuring location and spread respectively. Non-parametric tests are often used in place of parametric tests when the assumptions necessary for the parametric method do not hold; for instance the data might be more variable or more skewed than expected. However, if the assumptions are (approximately) correct, parametric methods should be used in preference to non-parametric methods as they provide more accurate and precise estimates, and greater statistical power [
13].
Many of the papers in this survey showed no clear understanding of the distinction between these types of tests, evidenced by reporting that made no statistical sense: e.g. "…continuous variables were determined to be parametric using Kolmogorov-Smirnov tests…", "…the t-test was used for parametric variances…", "…non-parametric statistics were used to compare outcome measures between groups (one way ANOVA)…" and "…Student's t-test and the Mann–Whitney test were used to analyse continuous data with and without normal distribution…". Continuous variables may be assumed to be approximately normal in an analysis, but it makes no sense to describe variables or variances as parametric. It is also incorrect to label an analysis of variance (ANOVA) as non-parametric. In at least 5 papers (5%; 95% CI 2–12%), the authors opted to use non-parametric statistical methods, but then summarised data in tables and figures using means and standard deviations, the parameters of the normal distribution, rather than correctly using medians and ranges or inter-quartile ranges.
The survey showed that 52% (42–62%) of papers used non-parametric tests inefficiently; that is they reported the results of non-parametric tests for outcomes that evidence from the paper suggested were approximately normal. Three papers (3%; 95% CI 0–9%) compared the lengths of time to an outcome event between groups by using the non-parametric Mann–Whitney (M-W) test based on converting the times to ranks. By doing this, much of the information about real differences between individual records is lost; for example outcomes of 1 day, 2 days and 100 days become 1, 2 and 3 when converted to ranks. Although times are often positively skewed, they are usually approximately normally distributed after logarithmic transformation [
14]. A more efficient analysis can therefore usually be achieved by using a
t-test on log-transformed times rather than applying a M-W test to untransformed data. This is not to say that non-parametric tests should never be used, but that for many variable types (e.g. times, areas, volumes, ratios or percentages) there are simple and well-known transformations that can be used to force the data to conform more closely to the assumptions required for parametric analysis, such as normality or equality of variances between treatment groups.
Problems of multiple comparisons, or multiple testing, occur when considering the outcomes of more than one statistical inference simultaneously. In the context of this survey, it is best illustrated by considering a number of reported statistical tests for one study all reporting evidence for significance at the 5% level. By definition, if one undertakes 20 hypothesis tests on data where we know that there is no true difference, we will expect to see one significant result at the 5% level by chance alone. Therefore, if we undertake multiple tests, we require a stronger level of evidence to compensate for this. For example, the Bonferroni correction preserves the ‘familywise error rate’ (α), or the probability of making one or more false discoveries, by requiring that each of n tests should be conducted at the α/n level of significance, i.e. it adjusts the significance level to account for multiple comparisons [
15].
The questionnaire recorded the number of hypotheses tested in each paper, based on an approximate count of the number of p-values reported. Three papers did not report p-values, 31 papers (31%; 95% CI 22–41%) reported less than 5 p-values, 36 papers (36%; 95% CI 27–46%) reported between 5 and 20 p-values and 30 papers (30%; 95% CI 21–40%) reported more than 20 p-values. Issues of the relevance and the need for formal adjustment for multiple comparisons will clearly be very problem specific [
16]. Whilst most statisticians would concede that the formal adjustment of p-values to account for multiple comparisons may not necessarily be required when reporting a small number of hypothesis tests, if reporting more than 20 p-values from separate analyses, some discussion of the rationale and need for so many statistical tests should be provided and formal adjustment for multiple-comparison considered. In an extreme case, one paper reported a total of 156 p-values without considering the effect of this on inferences from the study. A Bonferroni correction to the significance level would have resulted in at least 21 of the 35 reported significant p-values in this study to be regarded as no longer significant. Where some adjustment was made for multiple comparisons (7 papers), the Bonferroni correction was the most common method (5 papers). One other paper used Tukey’s Honestly Significant Difference (HSD) test and another set the significance level to 1% (rather than 5%) in an ad-hoc manner to account for undertaking 10 tests.
Presentation of results
The clear and concise presentation of results, be it the labelling of tables and graphs or the terminology used to describe a method of analysis or a p-value, is an important component of all research papers. The statistical assessment of the study papers identified two important presentational issues.
The statistical assessors were asked to comment on the quality of the data presentation in the papers which included tables and graphs. Graphs and tables were clearly titled in only 29% (21–39%) of papers. For instance, typical examples of uninformative labels included “Table I: Details of Study” and “Table II: Surgical Information”. Furthermore, only 43% of graphs and tables were considered to be clearly labelled. In particular, a number of the papers included tables with data in parentheses without further explanation. The reader was then left to decide whether the numbers indicated, for example, 95% confidence intervals, inter-quartile ranges (IQRs) or ranges. Some tables also included p-values with no indication of the statistical test used. The description of graphical displays was occasionally confusing. One paper stated that the bars of a box-and-whisker plot represented the maximum and minimum values in a dataset, when there were clearly points outside the bars. By convention, the bars represent 1.5 times the inter-quartile range, with points outside the bars identified as ‘outliers’. Interestingly, another paper claimed that the boxes showed standard deviations, rather than the correct IQR, so there is clearly a wider misunderstanding of these figures.
Raw data for individual patients (or experimental units) were displayed graphically or in tables in only 9% (4–17%) of papers. Raw data, as opposed to means, medians or other statistics, always provide the simplest and clearest summary of a study, and direct access to the data for the interested reader. Although we accept that there may be practical reasons why authors would not want to present such data, it is disappointing that such a small proportion of investigators decided to do so.
The lack of appropriate statistical review, either prior to submission or at the review stage, was apparent in the catalogue of simple statistical reporting errors found in these papers. For instance, methods were reported that, to our knowledge, do not exist: e.g. "multiple variance analysis" or the “least squares difference" post-hoc test after ANOVA. Presumably the latter refers to a least significant difference test, but the former is ambiguous. Another class of reporting error were those that simply made no statistical sense in the context they were reported: e.g. "…there was no difference in the incidence among the corresponding groups (chi-squared test, p = 0.05)…", and "…there were no significant differences in the mean T-score or Z-score between the patients and controls…". The former remark was made in the context of rejecting the null hypothesis at the 5% level for significance and the latter presumably implied that mean t-statistics and z-scores were compared between groups, which makes no statistical sense. The inadequate or poor reporting of p-values was also widespread, and typical errors included "p < 0.000009", “p < 0.134” and, more generally, the use of “p = NS” or “p < 0.05”. P-values should generally be quoted to no more than 3 decimal places, and be exact (as opposed to an inequality e.g. p < 0.05), unless very small when p < 0.001 is acceptable.