Background
In this “big data” era, the richness of information and the speed at which it spreads is nearly unimaginable. While this may improve convenience, it also leads to a certain disorder of both information and knowledge. This poses a challenge that can impede the effective use of research publications and related knowledge. This problem requires numerous efforts to build new strategies of information retrieval or knowledge organization in order to facilitate knowledge discovery and analysis [
1].
Scientific literature is an important part of the overall universe of information resources. Quantitative analysis of this provides a basis for knowledge management that requires highly accurate information retrieval. Among the numerous challenges with achieving quality retrieval, two are particularly salient in the context of stem cell research. First, across various scientific disciplines that may be interested in stem cell research, there is variance in expressing the same concept. Thus, it is very difficult to list every synonym (or quasi-synonym) for each concept. Second, some technologies cover content in disparate knowledge fields, and therefore cannot be searched from a single perspective. For example, tissue engineering covers a broad range of scientific research, and so it is difficult to anticipate and provide for all the related search terms without creating a unique ontology or other techniques. Recognizing this as a research topic that is important not only in the context of stem cell research but in other domains, we have designed an approach to partially but effectively address the problem of literature retrieval and classification [
2,
3].
At present, three databases, PubMed, Web of Knowledge, and Scopus, are widely used in biomedical informatics [
4,
5]. Each of these three platforms can be called a Knowledge Organization System (KOS), another term for a classification system or, depending on its semantic structuring, a thesaurus, which allows for statistical analysis of the retrieval results [
4]. Among these three platforms, only PubMed indexes literature with a standardized vocabulary, the Medical Subject Headings (MeSH). MeSH is a highly representative and widely used controlled vocabulary designed for the indexing of journal articles, books, and related artifacts in the health and life sciences [
6,
7]. Within the MeSH vocabulary, semantics are clear and it has strict norms combined with logical relationships among terms. When appropriately applied, MeSH has the potential to clearly provide subject headings for the biological information being communicated through an article [
8,
9]. It is widely considered the most effective controlled terminology for biomedical literature retrieval and certain types of information mining [
5,
10‐
13]. Using controlled, organized vocabularies or thesauri, such as MeSH, enables researchers to spend less time and effort in synonym merging and polysemy splitting to ascertain relationships between terms and concepts.
As a KOS, a thesaurus can sort and reorganize knowledge according to the content and characteristics of the subject. When used in literature retrieval, a thesaurus can recognize hyponymy of a retrieved field and recognize the sophisticated classification of the retrieved data set [
14,
15]. By doing so, thesauri can be helpful for solving the problem of retrieval by improving recall and precision. However, thesauri often express specific things while data analysis is broader and involves items of a similar property, otherwise known as a category. That is to say, when a category is retrieved using one or two terms from the thesaurus, it may not be enough to narrow the search enough to find appropriate results [
16,
17]. Additional thesauri need to be included and merged into different categories. Furthermore, a thesaurus such as MeSH includes many terms that cover all the areas of health science, in order to manage knowledge and for effective retrieval. MeSH terms and the current hierarchical structure, therefore, might lack “specificity” to identify a precise set of thesauri to fully represent the concepts and knowledge structure in a special interdisciplinary research area [
18]. This is needed for stem cell research in order to retrieve literature with satisfactory rates of precision and recall.
Although covering a broad range of topics in health sciences, the hierarchical structures of MeSH do not have the specificity required in specific technical domains such as stem cell research. This is one of the reasons why it is not ideal for patient records. However, many concepts in MeSH showed as “hierarchical tree structure” do have rather specific sub-concepts or so-called “children” (even in more than one tree), but these are not always the semantic relations or syntactic structures needed for a particular context (such as stem cell research). For example, the term “K562 Cells” resides in the lower level of “Myeloid Progenitor Cells” and is an important keyword in the field of stem cells. However, the term “K562 Cells” is related to the cell lines widely used in regular lab experiments, and it has little to do with the stem cells research. In a literature search using “Myeloid Progenitor Cells,” the system traversed the “child” terms, including “K562 Cells”, and thus the search could bring in noisy literature.
In fact, multiple, combinational search keywords were needed to cover full-specific topics, such as stem cell research. Both automated and/or human recommended MeSH terms were used for the literature search and annotation [
19,
20]. Professional searchers craft combinational search queries when using MeSH to search specific subjects. The terms reside discretely or aggregate in different locations of the hierarchy structure of the MeSH “tree structure.” For example, the hyponyms of “stem cells” include “embryonic stem cells,” “induced pluripotent stem cells,” “adult stem cells” and “cancer stem cells” in the field of stem cell research. However, stem cell research covers an even wider scope, and terms such as “hematopoietic stem cell transplantation,” “stem cell microenvironment” and “transdifferentiation” are not hyponyms under “stem cells.” Therefore, only using “stem cells” to search might omit certain relevant research related literature. MeSH indexing is based on the concept level, and for stem cell research activities, such as knowledge discovery and organization, one or two MeSH terms are usually not enough to represent the complexity of stem cell research. We propose to address this by attempting to reconstruct MeSH into a classified thesaurus based on a specific perspective and context.
The stem cell area is at the cutting-edge of science research. New breakthroughs are frequently and actively reported in the literature. Some of them were listed in the top ten lists of scientific breakthroughs in Science Magazine since 1999. Stem cells, or related derivatives, can be transplanted into a patient to replace damaged cells and regenerate new cells or tissues [
21]. It brings new hope for the treatment of “incurable diseases” [
22]. Using stem cells for the development of disease models for drug screening and/or pinpointing disease mechanisms can help researchers understand the mechanism of pathogenesis for complex diseases [
23]. Stem cell research is a rapidly changing field, and it an interdisciplinary and multi-faceted area of study. Selecting all of the relevant stem cell literature from the interconnected disciplinary based literature databases required scientists to create a better set of key words to allow for search completeness and accuracy. The exploration of stem cell related MeSH classifications could also benefit the knowledge construction and classification for further future stem cell research.
This reconstruction is the process of knowledge reorganization. If embedded into a retrieval system, the classified thesaurus will create literature category navigation and automatic classification. This helps achieve an automated literature analysis that would be useful in serving the information needs of researchers in this domain, which could be replicated for other research areas.
Discussion
In this study, the procedure of constructing a classified thesaurus was explored and validated through precision and recall ratio tests of combinatory literature search queries with selected MeSH based terms, understanding these terms’ hierarchy relationships, and grouping MeSH terms into different stem cell based knowledge organization systems. The principles of construction were: 1) The MeSH terms were classified based on high/low frequency; 2) Reject terms with too general concept and terms of some substances, such as protein, DNA sequence and molecule; 3) Expert judgement for term selections; and 4) Retrieval experiment.
The validity of the TSRSC was tested by a Vector Space Model. The results of this portion of the study supported the hypothesis that there is significant correlation between the TSRSC terms and retrieval results. Careful selection of a set of appropriated TSRSC terms provided thorough and precise search results. The retrieval results proved to yield both high recall and precision ratios.
In the future, a classified thesaurus for other fields can be constructed following this method. This method can be enhanced by automation, in which artificial intelligence programs can pre-filter some redundant or unrelated terms, such as protein and DNA sequence, to reduce the judgment-work of experts. Data mining programs can also be used to count the MeSH term frequencies, which would include root nodes. In this way, the position of the terms can be quickly ascertained. Furthermore, the MeSH terms can be further classified into a subcategory to provide a further break-down of knowledge organization.
MeSH is an authoritative controlled vocabulary that is highly effective for bibliographic retrieval for systems that use it. However, studies and literature reviews on inter-indexer consistency often find variability and inconsistency, with consistency rates in the 20–40% range [
25‐
27]. Our method takes this into consideration and, in part, addresses this problem to enhance term specificity by looking at how these terms posit in the subcategories within the thesaurus. This is helpful for improving retrieval results of the MeSH terms, which are indexed incompletely. If these selected MeSH terms and the process of stem cell-based knowledge reorganization can be embedded into a retrieval system, the classified thesaurus could support literature category navigation and automatic classification, further achieving automated literature analysis.
Using stem cell research literature as a case study, it was possible to research topic-based knowledge organization schemes, refine, and extend the utilization of MeSH terms and the hierarchical structures for organized stem cell research. This method also facilitates classification-based literature navigation, literature auto-classification, and benefiting knowledge discovery and mining, and scientific data analysis.
The classification of stem cell research literature is more than just simply cross-walk between classification schemes (e.g., Dewey Decimal System) or thesauri (e.g., MeSH), otherwise some meaningful information may be missed if unmatched between two schemes. Upper class term matching may only yield a proximity which may not be precise enough. Thus, the fine-gained classification method proposed in this study provides a new way to construct classification schemes, thus enhancing clarity. The study provided new potentials to possibly merge different classification schemes or metadata connecting with different retrieval systems. It also proposed possible unified retrieval solutions, then obtain related literature across different databases or information repositories. Better than the cross-walk approach between two schemas that might have the potential to miss important information, our approach can evaluate the search term selection efficiency in precision and recall, as well as provide an assessment of their positions and relationships within the MeSH hierarchy structure. The build-up of combinational searchable query using selected MeSH terms could improve literature retrieval and knowledge organization regardless of the complexity of the scientific subject, even it is interdisciplinary across several fields.
The research findings suggested that the TSRSC query given is useful to improve the literature retrieval in stem cell research, but replication of the methodology is required for new queries in other scientific domains as the query is specific to the scientific domain of stem cell research. In order to foster query creation and methodology in other domains, information professionals should advise researchers that the expert steps require significant time and resource investment in order to achieve quality results.
Conclusions
This study demonstrated a new approach that can mechanically identify and refine a set of MeSH terms to facilitate literature retrieval and knowledge organization in a particular scientific domain. The study also identified non-MeSH terms that were valuable for improving precision and recall for literature retrieval and therefore enrich the descriptions of a dynamic and interdisciplinary research area such as stem cell research. Based on the reconstructed thesaurus created from these selected terms, the study indicated that identifying a proper set of MeSH terms, and their alternatives, improves the precision and recall ratios of the literature retrieval. Through a case study of stem cell research, we explored a way to construct a classified thesaurus, thus offering one possible solution to the problem of literature retrieval. This classified thesaurus results from a knowledge reconstruction that categorizes existing thesauri and ultimately improves the retrieval of clustered literature. With experiment-proven efficacy, this exploration lays the foundation for future work–automatic literature classification and navigation which may serve science and technology management. This approach can guide scientists, medical researchers, and librarians to use MeSH, or other tools, to construct a set of domain-specific search terms for literature research; simultaneously, this study will allow information professionals to develop a MeSH-based classified thesaurus for knowledge management in regards to a particular research topic. The approach outlined in this study can optimize retrieval performance and knowledge organization by selecting the right MeSH terms when dealing with a dynamic and complex scientific research topic for scientists.