Background
Methods
Background
Evaluation of the TMF guideline for data quality
Computing data quality with R
Application example
Results
Structure of the data quality framework
Name Dimension Domain | Definition | Primary reference objects to detect data quality issues | Primary reporting metrics of indicators |
---|---|---|---|
Integrity | The degree to which the data conforms to structural and technical requirements. | ||
Structural data set error | The observed structure of a data set differs from the expected structure. | Data elements, data records | N |
Relational data set error | The observed correspondence between different data sets differs from the expected correspondence. | Data sets | N |
Value format error | The technical representation of data values within a data set does not conform to the expected representation. | Data fields | N, % |
Completeness | The degree to which expected data values are present. | ||
Crude missingness | Metrics of missing data values that ignore the underlying reasons for missing data. | Data fields | N,% |
Qualified missingness | Metrics of missing data values that use reasons underlying missing data. | Data fields, data elements, data record | N,% |
Consistency | Consistency | ||
Range and value violations | Observed data values do not comply with admissible data values or value ranges. | Data fields | N,% |
Contradictions | Observed data values appear in impossible or improbable combinations. | Data fields | N,% |
Accuracy | The degree of agreement between observed and expected distributions and associations. | ||
Unexpected distributions | Observed distributional characteristics differ from expected distributional characteristics. | Data elements, data records | Diverse statistical measuresa |
Unexpected associations | Observed associations differ from expected associations. | Data elements, data records | Diverse statistical measuresa |
Disagreement of repeated measurements | Disagreement between repeated measurements of the same or similar objects under specified conditions. | Data elements, data records | Diverse statistical measuresa |
ID | Name of indicator | Definition |
---|---|---|
Integrity | ||
DQI-1001 | Unexpected data elements | The observed set of available data elements does not match the expected set. |
DQI-1002 | Unexpected data records | The observed set of available data records does not match the expected set. |
DQI-1003 | Duplicates | The same data elements or data records appear multiple times. |
DQI-1004 | Data record mismatch | Data records from different data sets do not match as expected. |
DQI-1005 | Data element mismatch | Data elements from different data sets do not match as expected. |
DQI-1006 | Data type mismatch | The observed data type does not match the expected data type. |
DQI-1007 | Inhomogeneous value formats | The observed data values have inhomogeneous format across different data fields. |
DQI-1008 | Uncertain missingness status | System indicated missing values (e.g. NA/./Null …) appear where a qualified missing code is expected. |
Completeness | ||
DQI-2001 | Missing values | Data fields without a measurement value. |
DQI-2002 | Non-response rate | The proportion of eligible observational units for which no information could be obtained. |
DQI-2003 | Refusal rate | The proportion of eligible individuals who refuse to give the information sought. |
DQI-2004 | Drop-out rate | The proportion of all participants who only partially complete the study and prematurely abandon it. |
DQI-2005 | Missing due to specified reason | Information in a data collection that is missing due to a specified reason. |
Consistency | ||
DQI-3001 | Inadmissible numerical values | Observed numerical data values are not admissible according to the allowed ranges. |
DQI-3002 | Inadmissible time-date values | Observed time-date values are not admissible according to the allowed time and date ranges. |
DQI-3003 | Inadmissible categorical values | Observed categorical data values are not admissible according to the allowed categories. |
DQI-3004 | Inadmissible standardized vocabulary | Data values are not admissible according to the reference vocabulary. |
DQI-3005 | Inadmissible precision | The precision of observed numerical data values does not match the expected precision. |
DQI-3006 | Uncertain numerical values | Observed numerical values are uncertain or improbable because they are outside the expected ranges. |
DQI-3007 | Uncertain time-date values | Observed time-date values are uncertain or improbable because they are outside the expected ranges. |
DQI-3008 | Logical contradictions | Different data values appear in logically impossible combinations. |
DQI-3009 | Empirical contradictions | Different data values appear in combinations deemed impossible based on empirical reasoning. |
Accuracy | ||
DQI-4001 | Univariate outliers | Numerical data values deviate markedly from others in a univariate analysis. |
DQI-4002 | Multivariate outliers | Numerical data values deviate markedly from others in a multivariate analysis. |
DQI-4003 | Unexpected locations | Observed location parameters differ from expected location parameters. |
DQI-4004 | Unexpected shape | The observed shape of a distribution differs from the expected shape. |
DQI-4005 | Unexpected scale | Observed scale parameters differ from expected scale parameters. |
DQI-4006 | Unexpected proportions | Observed proportions differ from expected proportions. |
DQI-4007 | Unexpected association strength | The observed strength of an association deviates from the expected strength of the association. |
DQI-4008 | Unexpected association direction | The observed direction of an association (e.g. negative, positive) deviates from the expected direction. |
DQI-4009 | Unexpected association form | The observed form of an association (e.g. linear, quadratic, exponential...) deviates from the expected form. |
DQI-4010 | Inter-Class reliability | Differences between classes (e.g. examiners) when measuring the same or similar objects under specified conditions. |
DQI-4011 | Intra-Class reliability | Differences within classes (e.g. examiners) when measuring the same or similar objects under specified conditions. |
DQI-4012 | Disagreement with gold standard | Differences with a gold standard when measuring the same or similar objects under specified conditions. |
Integrity
Completeness
Correctness: consistency and accuracy
Implementations
Descriptors
Data quality and process variables
R-function name | Implementations within the function | Linked with the following indicators |
---|---|---|
pro_applicability_matrix() | Checks the correspondence of study data with the metadata and accessibility to files. Each study data variable is examined regarding the data type and cross-checked with the specified data type in the metadata. | Unexpected data elements; data type mismatch |
com_unit_missingness() | Evaluates on the level of entire observational units whether all measurements are missing. | Missing measurements (Unit level) |
com_segment_missingness() | Evaluates whether all associated measurements at the level of study segments (e.g. single examinations or instruments) are missing for an observational unit. A pattern plot is provided as a descriptor. | Missing measurements (Segment level); |
com_item_missingness() | Examines for each variable of the study data the amount and type of missing data according to specified missing/jump codes, including a count of data fields without any data entry like NA in R. | Missing measurements (Item level); specific missingness; uncertain missingness status |
con_limit_deviations() | Assesses limit deviations, with regards to inadmissible and improbable values and counts deviations above/below the specified thresholds. Limits may comprise hard limits to identify inadmissible values, soft limits to identify improbable values, and detection limits which refer to a censoring based on the properties of the measurement devices used. | Inadmissible numerical values; inadmissible time-date values; uncertain numerical values; uncertain time-date values |
con_inadmissible_categorical() | Compares the match of single data values with admissible categories, summarizes observed vs. expected data values and counts the violations. | Inadmissible categorical values |
con_contradictions() | Compares two data values of the same observational unit by using one of 16 logical comparisons. Counts the number of contradictions. | Logical contradictions; empirical contradictions |
acc_distributions() | Creates distributional plots (bar or histogram) for numerical measurements (float, integer). If a grouping variable is provided, stratified empirical cumulative distribution functions (ecdf) are plotted as well [20]. | Indicators within the unexpected distributions domain |
acc_univariate_outlier() | Univariate outliers | |
acc_multivariate_outlier() | Multivariate outliers | |
acc_shape_or_scale() | Unexpected shape parameter; unexpected scale parameter | |
acc_end_digits() | Unexpected shape | |
acc_margins() | Compares the marginal distribution of different classes (e.g. examiners, devices) using measurements adjusted for covariates (e.g. age, sex). Adjusted linear models, logistic regression or poisson-regression are used to model marginal means of continuous measurements, binary, and count data [48]. | Unexpected location; unexpected proportion |
acc_varcomp() | Unexpected location | |
acc_loess() | Computes and displays as a descriptor loess-smoothed trends of measurements across different classes over time. The raw measurements can be adjusted for covariates such as age or sex and the resulting residuals are smoothed over time using LOESS [42]. | Indicators within the unexpected distributions domain, foremost unexpected location; unexpected proportion |
Using R and the data quality workflow
Discussion
TMFID | TMF name | Related in current framework to concept | Description of element type/ implementation in current framework |
---|---|---|---|
TMF-1001 | Agreement with previous values | Disagreement of repeated measurements | Domain |
TMF-1003 | Consistency | Contradictions | Domain |
TMF-1004 | Certain contradiction/error | Certain contradictions | Indicator |
TMF-1005 | Possible contradiction/warning | Uncertain contradictions | Indicator |
TMF-1006 TMF-1009 TMF-1010 TMF-1011 TMF-1052 | Distribution of values Distribution of parameters recorded by the investigator Distribution of parameters recorded by the device Distribution of findings recorded by a medical reader Distribution of parameters between study sites | Unexpected location parameter Unexpected shape parameter Unexpected scale parameter Unexpected proportion | Indicator but TMF differentiates by the influencing factor while the current framework distinguishes by the statistical aspect. |
TMF-1012 | Missing modules | Unexpected data elements | An implementation that identifies missing modules within the indicator unexpected data elements |
TMF-1013 | Missing values in data elements | Missing values | Indicator |
TMF-1014 | Missing values in mandatory data elements | Missing values | An implementation that identifies mandatory data elements within the indicator missing values |
TMF-1016 | Data elements with value unknown etc. | Missing due to specified reason | Indicator (TMF targets a specific reason for missing value: unknown values) |
TMF-1018 | Outliers (continuous data elements) | Univariate outliers | Indicator |
TMF-1019 | Values that exceed the measurability limits | Inadmissible numerical values | Implementation within inadmissible numerical values |
TMF-1021 | Illegal values of qualitative data elements | Inadmissible categorical values | Indicator |
TMF-1022 | Illegal values of qualitative data elements used for the coding of missings | Inadmissible categorical values | An implementation that identifies inadmissible coding of missing modules within the indicator inadmissible categorical values |
TMF-1023 | Illegal values used for the coding of missing modules | Inadmissible categorical values | An implementation that identifies inadmissible coding of missing values within the indicator inadmissible categorical values |
TMF-1024 | Illegal values of qualitative data elements used for the coding of results exceeding measurability limits | Inadmissible categorical values | An implementation that identifies data elements with codes related to measurability limits within the indicator inadmissible categorical values |
TMF-1029 | Duplicates | Duplicates | Indicator |
TMF-1030 | Recruitment rate | Nonresponse rate | Indicator, the current framework uses the inverse. The link between both depends on the definition of recruitment and nonresponse rates |
TMF-1031 TMF-1032 | Refusal rate of investigations Refusal rate of modules | Refusal rate | Indicator with implementations at the level of examination modules or the entire study |
TMF-1034 | Drop-out-rate | Drop-out rate | Indicator |
TMF-1042 | Observational units with follow-up | Non-response rate (inverse at unit level, depending on implementation form) | Indicator |
TMF-1043 | Accuracy | Accuracy | Dimension |
TMF-1046 | Completeness | Completeness | Dimension |