Background
MEDLINE is a bibliographic database that includes metadata and citations of biomedical literature. Although it covers articles in varying areas, including medicine, pharmacy, and biology from around the world, it does not categorize those articles by specific research area. The bibliographic content on MEDLINE is manually indexed through MeSH (Medical Subject Headings) [
1], and the content can be searched for a specific topic using MeSH in Pubmed. However, because MeSH was originally designed to index, catalogue, and search articles with a controlled vocabulary thesaurus, it is difficult to apply it to the classification of academic disciplines.
Traditional medicine, particularly in Northeast Asia including traditional Chinese medicine (TCM), has developed from ancient times. A number of evidence-based articles have been published in this area in recent years. While MEDLINE also contains articles on traditional medicine, it does not offer a way to search for traditional medicine articles exclusively, making it difficult for researchers to analyze research trends in traditional medicine. Traditional medicine articles are often classified by MeSH headings such as “Medicine, Chinese Traditional”, but many articles remain without such a classification. Particularly in traditional medicine, a number of studies are being conducted in relation to herbal drugs, and these studies are generally classified by MeSH headings such as “Drugs, Chinese Herbal”. However, because studies on the effects of extracts or genomes of herbs are often identified by MeSH headings of “Plant Extracts” and “Genes, Plant”, respectively, it is not sufficient to use MeSH to determine whether the article is about traditional medicine. Therefore, in order to search for articles on traditional medicine, it is necessary to search for articles not only using MeSH, but also additional keywords. However, because different keywords will bring different search results, it is difficult to search exclusively for traditional medicine articles.
In academic disciplines, there generally exist journals that mainly publish articles for a specific discipline. However, all of the articles in the discipline are not always published in the given journal and literature databases such as MEDLINE include many journals covering various areas. Therefore, it is difficult to discriminate articles on traditional medicine from those of other disciplines by using only the journal information.
In order to overcome these difficulties, this research devised a classifier to identify articles on Northeast Asian traditional medicine by using the Support Vector Machine (SVM), which is widely used in text mining. We also constructed a web server called DisArticle, in which only articles on traditional medicine can be searched for from among all articles in MEDLINE. The major goal of DisArticle was to reduce the workload of researchers by reducing the number of articles they search and identify. This can help them to easily analyze research trends in traditional medicine.
Much research on machine learning techniques has been done, such as classification based on the MEDLINE database. The research on classification mainly has been done to discover new knowledge such as protein-protein interactions [
2] or gene disease associations [
3]. This research has been also used to extract gene terms [
4] or chemical names [
5] within the content of articles. Recently, an SVM-based classifier was constructed to determine whether a certain article describes a randomized clinical trial (RCT) [
6]. MEDLINE not only includes the article publication type of the RCT, but also defines what work is about the RCT (
http://www.ncbi.nlm.nih.gov/mesh/68016449). However, because the identification of RCTs is conducted in a simple way, this study proposes a classifier model to identify RCT articles using only the metadata and MeSH terms of each article.
Discussion
The efficacy of traditional medicines from Northeast Asia, including TCM, has been proven clinically since ancient times. With the development of modern medicine, a number of published studies have shown scientific evidence on the efficacy of traditional medicine. What significantly distinguishes traditional medicine in Northeast Asia from other regional traditional medicines, such as traditional African medicine or Ayurveda medicine, is mainly the use of herbal drugs [
16]. Certain medicinal herbs are only grown in Northeast Asia, and the same herbs may produce different medicinal effects depending on the region where they are grown. This research showed that herbal data are an important feature for identifying articles on traditional medicine. Therefore, our classifier should be trained with herbal data from other regions to provide broader coverage of other traditional medicines from around the world.
Recently, because of the increasing global interest in healthcare, a number of studies on complementary and alternative medicines (CAMs) are being done. CAMs are generally known as any medicinal practice that does not originate from scientific evidence. The CAM category includes TCM and other herbal medicines, as well as non-traditional medicines [
17]. In order to build a web server to identify articles on CAMs, it is of foremost necessity to define criteria that can be used to show whether a certain article is about a CAM. This study used the WHO IST and the pharmacopoeia of Korea, China, and Japan as the criteria to identify articles on Northeast Asian traditional medicine. If the criteria for the identification of CAM articles is defined, it will be possible to establish an article identification system for CAMs similar to our web server.
As stated in the background, the work of Cohen et al. is similar to ours. We both constructed SVM-based binary classifiers for articles on MEDLINE and provided methods to determine whether an article is in a particular research area or of a particular type. Cohen et al. used only metadata and MeSH to identify articles. However, to achieve good performance, identifying articles in the traditional medicine field requires more features such as those that describe medicinal herbs.
In future work, we will do experiments with not only the SVM but also with a variety of machine learning algorithms. In addition, although the web server currently provides only basic statistical data on articles on traditional medicine, it will be updated to provide more professional trend analysis [
18] or meta-analysis [
19] data about traditional medicine articles.