Product risk management involves critical assessment of health product safety issues by evaluating their risk-benefit ratio, on occasion followed by taking appropriate regulatory actions to mitigate these safety concerns [
1]. Environmental scanning of safety information of health products is usually conducted through several sources to ensure thorough search and extensive coverage. Firstly, local trends of adverse events incidence can be obtained from spontaneous adverse events reporting. These adverse events are observed during clinical practice and reported to the regulatory agency by healthcare professionals or by pharmaceutical manufacturers. Information on product safety may also be obtained from safety alerts disseminated by different regulatory authorities such as the Therapeutic Goods Administration in Australia, the Food and Drug Administration in the United States of America and the European Medicines Agency in Europe. Regulatory authorities all over the world work closely together and alert each other of any safety concerns raised. Another source of safety information will be the primary literature where results of clinical trials, case reports and other safety-related studies may be reported.
Primary literature remains as a valuable source of drug safety information, especially for newer drugs where there is little regulatory experience with them. With intensive research in medical sciences, primary literature is generated at a very rapid rate. In PubMed alone, more than 600,000 articles are added each year from over 5000 journals [
2]. Although, there is voluminous amount of information available, only a small portion is useful for risk assessment. For instance, a search in PubMed for tumour necrosis factor-alpha (TNF-α) blockers using the search terms 'adalimumab', 'infliximab' and 'etanercept' in Medical Subject Headings (MeSH) [
3] retrieved close to 4000 articles. However, only about 700 (17.5%) contain valuable information for risk assessment work. Thus, it is time-consuming and inefficient to manually sieve through this large number of articles and identify those that are valuable to product risk assessment. Hence, the ability to expedite this process of useful literature identification can contribute to risk assessment efficiency.
Text mining as a potential solution
Text mining is defined as the process of retrieving or extracting small nuggets of relevant information from large collections of textual data [
4]. It is a powerful tool to identify word usage patterns and has already been effectively deployed in many areas such as email classification [
5,
6], legal/business applications [
7,
8] and biomedical text analysis[
9,
10]. In one study, Agarwal and Yu had shown that the use of text mining was able to achieve 91.95% of accuracy in automatically classifying sentences from biomedical full text into introduction, methods, discussion and conclusion categories [
11]. Wang et al showed that text mining was able to achieve 95% sensitivity and specificity in 51.5% of abstracts that were automatically classified for the purpose of Immune Epitope Database [
12]. Recently, text mining was used to improve systematic reviews of adverse drug reactions by identifying such articles from medical literature with a recall of 70% and precision of 21% [
13]. Hence, text mining had been shown to be a powerful and useful tool in the automated classification of textual data. It will be interesting to explore text mining as a potential solution for developing an automated system to identify relevant documents for product risk management work.
In this work, two automated systems were developed and explored for their usefulness in identifying 'useful' articles from their titles and abstracts (henceforth referred to as just abstracts) in the PubMed database. The first system used only general terms that were found in the abstracts as predictors and thus is able to identify 'useful' articles regardless of the drug class. However such general automated system may not have sufficiently high accuracy for routine risk assessment work. Thus, a second system which was specific for a particular drug class was developed and tested. During routine work, evaluators will manually classify a small number of articles related to the drug class of interest. The second system will then learn from the abstracts of these articles and develop a model that is specific for that drug class.
In order to develop the two automated systems, large amounts of journal articles have to be manually classified. Ideally, a large number of articles from different drug classes should be used, especially for the development of the general automated system. However, it is tedious and impractical to perform manually classification of such a large number of articles. Thus, in this study, only articles on TNF-α blockers were manually classified and used to develop the two automated systems. TNF-α blockers were chosen because they have a relatively small corpus of literature, which make it suitable for manual classification.
TNF-α blockers are biologics that are indicated for several autoimmune diseases such as psoriasis, rheumatoid arthritis and Crohn's disease, and are playing an emerging role under circumstances when these disease conditions are refractory to conventional therapies [
14]. However, intensive post-marketing surveillance [
15,
16] and case reports [
17‐
19] reveal rare but severe adverse effects such as opportunistic infections, reactivation of latent infections, new-onset psoriasis and lymphomas. It is unsure whether these adverse effects are due to predisposition by the underlying diseases or the adverse effects of TNF-α blockers per se, therefore post-marketing surveillance is of paramount importance in monitoring the safety profile this group of drugs.