Background
Objective and specific aims
Specific aims
-
Describe the methods used by SWIFT-Review to conduct topic modeling, categorization of studies, and priority ranking for relevance.
-
Present performance benchmarks for priority ranking based on a comparison of SWIFT-Review to manual review for 20 data sets of various size and complexity. Fifteen of the 20 data sets are public datasets that have been used to evaluate the performance of other text-mining software tools [6].
-
Present an example of how SWIFT-Review can be used to prepare a scoping report on an example topic (endocrine-disrupting chemicals; EDCs) selected because of the large size of its literature base and for its complexity in terms of number of chemicals, range of health effects, and types of evidence (human, animal, in vitro).
Methods
Document import and search
Bag-of-words model to characterize document features
Topic modeling
Document prioritization
Log-linear model
Assessing document prioritization performance
Datasets
Data set | Source | Database (inputs) | Records from search | Included | Excluded | Comments |
---|---|---|---|---|---|---|
PFOA/PFOS and immunotoxicity | NIEHS | PubMed (PMIDs) | 6331 | 95 (1.5 %) | 6236 (98.5 %) | Targeted topica |
Bisphenol A (BPA) and obesity | NIEHS | PubMed (PMIDs) | 7700 | 111 (1.4 %) | 7589 (98.6 %) | Targeted topic |
Transgenerational inheritance of health effects | NIEHS | PubMed (PMIDs) | 48,638 | 765 (1.6 %) | 47,873 (98.4 %) | Untargeted topic |
Fluoride and neurotoxicity in animal models | NIEHS | Multiple (titles + abstracts) | 4479 | 51 (1.1 %) | 4428 (98.9 %) | Targeted topic |
Neuropathic pain | CAMARADES | Multiple (titles + abstracts) | 29,207 | 5011 (17.2 %) | 24,196 (82.8 %) | Semi-targeted topic |
Skeletal muscle relaxants | [6] | PubMed (PMIDs) | 1643 | 9 (0.6 %) | 1634 (99.4 %) | Public dataset |
Opioids | [6] | PubMed (PMIDs) | 1915 | 15 (0.8 %) | 1900 (99.2 %) | Public dataset |
Antihistamines | [6] | PubMed (PMIDs) | 310 | 16 (5.2 %) | 294 (94.8 %) | Public dataset |
ADHD | [6] | PubMed (PMIDs) | 851 | 20 (2.4 %) | 831 (97.6 %) | Public dataset |
Triptans | [6] | PubMed (PMIDs) | 671 | 24 (3.6 %) | 647 (96.4 %) | Public dataset |
Urinary Incontinence | [6] | PubMed (PMIDs) | 327 | 40 (12.2 %) | 287 (87.8 %) | Public dataset |
Ace Inhibitors | [6] | PubMed (PMIDs) | 2544 | 41 (1.6 %) | 2503 (98.4 %) | Public dataset |
Nonsteroidal anti-inflammatory | [6] | PubMed (PMIDs) | 393 | 41 (10.4 %) | 352 (89.6 %) | Public dataset |
Beta blockers | [6] | PubMed (PMIDs) | 2072 | 42 (2.0 %) | 2030 (98.0 %) | Public dataset |
Proton pump inhibitors | [6] | PubMed (PMIDs) | 1333 | 51 (3.8 %) | 1282 (96.2 %) | Public dataset |
Estrogens | [6] | PubMed (PMIDs) | 368 | 80 (21.7 %) | 288 (78.3 %) | Public dataset |
Statins | [6] | PubMed (PMIDs) | 3465 | 85 (2.5 %) | 3380 (97.5 %) | Public dataset |
Calcium-channel blockers | [6] | PubMed (PMIDs) | 1218 | 100 (8.2 %) | 1118 (91.8 %) | Public dataset |
Oral hypoglycemics | [6] | PubMed (PMIDs) | 503 | 136 (27.0 %) | 367 (73.0 %) | Public dataset |
Atypical antipsychotics | [6] | PubMed (PMIDs) | 1120 | 146 (13.0 %) | 974 (87.0 %) | Public dataset |
Performance metrics
Test procedure
Document tagging for problem formulation
Evidence stream
Health outcomes
Word | Type | TF | DF | TF_IDF score |
---|---|---|---|---|
Pulmonari | Title | 708 | 1777 | 0.372008 |
Lung | Title | 715 | 1623 | 0.362747 |
Lung neoplasms | MESH | 746 | 2003 | 0.269274 |
Lung | Abstract | 2459 | 3131 | 0.266611 |
Tuberculosis, pulmonary | MESH | 324 | 940 | 0.211322 |
Lung cancer | Title 2-gram | 241 | 474 | 0.209402 |
Pulmonari | Abstract | 1564 | 2571 | 0.204265 |
Asthma | Title | 281 | 889 | 0.1953 |
Asthma | MESH | 486 | 1675 | 0.193189 |
Respiratori | Title | 292 | 1024 | 0.175736 |
Lung diseases | MESH | 304 | 910 | 0.16359 |
Asthma | Abstract | 1045 | 1163 | 0.158508 |
Tuberculosi | Title | 233 | 1053 | 0.153724 |
Lung cancer | Abstract 2-gram | 588 | 689 | 0.145245 |
Pneumonia | Title | 178 | 547 | 0.139709 |
Bronchial | Title | 139 | 327 | 0.13054 |
Pulmonari tuberculosi | Title 2-gram | 86 | 236 | 0.112054 |
Small cell lung | Title 3-gram | 92 | 174 | 0.110415 |
Cell lung cancer | Title 3-gram | 87 | 161 | 0.10728 |
Pulmonari diseas | Title 2-gram | 74 | 146 | 0.105468 |
Pulmonari hypertens | Title 2-gram | 67 | 118 | 0.100461 |
Chronic obstruct | Title 2-gram | 71 | 124 | 0.098575 |
Chronic obstruct pulmonari | Title 3-gram | 56 | 99 | 0.095042 |
Obstruct pulmonari diseas | Title 3-gram | 55 | 99 | 0.093811 |
Pulmonari embol | Title 2-gram | 52 | 94 | 0.090121 |
Chemical exposure or treatment
-
Excluded all names of type “DisplayFormula” (i.e., chemical formulas like “H20”).
-
Obtained a set of 109,582 English words from SIL International Linguistics [17]. Any chemical terms that appeared in this list and were not the exact name of a Tox21 chemical (i.e., a synonym and not the original name) were removed. This removed ambiguous terms like “stuff” and “impact” but not “ethanol” or “toluene.”
-
Removed all terms with fewer than five letters (most of the ambiguous abbreviations).
-
Removed non-English chemical names.
-
Removed inverted chemical names.
Dataset used to assess document tagging and annotation features: Endocrine-disrupting chemicals
Results
Performance of prioritization algorithm
Cohen (2006) [6] | Matwin (2010) [28] | SWIFT-Review (25 trials) | ||
---|---|---|---|---|
WSS@95 [proportion of studies screened to achieve 95 % recall] | ||||
PFOA/PFOS and immunotoxicity | N/A | N/A | N/A | 0.805 [0.145] |
Bisphenol A (BPA) and obesity | N/A | N/A | N/A | 0.752 [0.198] |
Transgenerational inheritance of health effects | N/A | N/A | N/A | 0.714 [0.236] |
Fluoride and neurotoxicity in animal models | N/A | N/A | N/A | 0.870 [0.080] |
Neuropathic pain | N/A | N/A | N/A | 0.691 [0.259] |
SWIFT-Review mean | 0.766 [0.184] | |||
Skeletal muscle relaxants | 0.000 [0.950] | 0.265 [0.685] | 0.374 [0.576] |
0.556 [0.394]
|
Opioids | 0.133 [0.817] | 0.554 [0.396] | 0.364 [0.586] |
0.826 [0.124]
|
Antihistamines | 0.000 [0.950] | 0.149 [0.801] |
0.236 [0.714]
| 0.137 [0.813] |
ADHD | 0.680 [0.270] | 0.622 [0.328] | 0.526 [0.424] |
0.793 [0.157]
|
Triptans | 0.034 [0.916] | 0.274 [0.676] | 0.346 [0.604] |
0.412 [0.538]
|
Urinary incontinence | 0.261 [0.689] | 0.296 [0.654] | 0.432 [0.518] |
0.530 [0.420]
|
Ace inhibitors | 0.566 [0.384] | 0.523 [0.427] | 0.733 [0.217] |
0.801 [0.149]
|
Nonsteroidal anti-inflammatory | 0.497 [0.453] | 0.528 [0.422] | 0.672 [0.278] |
0.730 [0.220]
|
Beta blockers | 0.284 [0.666] | 0.367 [0.583] |
0.465 [0.485]
| 0.428 [0.522] |
Proton pump inhibitors | 0.277 [0.773] | 0.229 [0.721] | 0.328 [0.622] |
0.378 [0.572]
|
Estrogens | 0.183 [0.767] | 0.375 [0.575] | 0.414 [0.536] |
0.471 [0.479]
|
Statins | 0.247 [0.803] | 0.315 [0.635] |
0.491 [0.459]
| 0.436 [0.514] |
Calcium-channel blockers | 0.122 [0.828] | 0.234 [0.716] | 0.430 [0.520] |
0.448 [0.502]
|
Oral hypoglycemics | 0.090 [0.860] | 0.085 [0.865] |
0.136 [0.814]
| 0.117 [0.833] |
Atypical antipsychotics | 0.141 [0.809] | 0.206 [0.744] | 0.170 [0.780] |
0.251 [0.699]
|
Mean (Cohen benchmark) | 0.234 [0.716] | 0.335 [0.615] | 0.408 [0.542] |
0.488 [0.462]
|
SWIFT-Review grand mean | 0.540 [0.410] |
EDC case study: use of SWIFT-Review document tagging and annotation
Discussion
Document prioritization
Document tagging
Limitations and future developments
Conclusions
Acknowledgements
Funding
Availability of data and materials
-
Project Name: SWIFT-Review
-
Project Home Page: http://swift.sciome.com/
-
Operating System: Platform Independent
-
Programming Language: Java
-
Other Requirements: at least 8GB RAM
-
License: The software is free for public use. Installation instructions and licensing details are available at the project home page.