Background
MicroRNAs (miRNAs) consist of about 22 nucleotides and they are one category of endogenous short non-coding RNAs (ncRNAs) that could regulate the expression of target messenger RNAs (mRNAs) at the level of transcription and post-translation [
1‐
4]. There are 28645 miRNAs in the 21st version of miRBase [
5] including more than three thousand human miRNAs. As regulators of gene expression and protein production, on the one hand some of miRNAs serve as negative regulators by binding to the 3′-UTRs of the target mRNAs [
4]; on the other hand, the regulatory impact of some miRNAs is positive [
6,
7]. Thus miRNAs have effect on cell proliferation [
8], development [
9], differentiation [
10], apoptosis [
11], metabolism [
12,
13], aging [
12,
13], signal transduction [
14], and viral infection [
10]. Moreover, evidence is mounting that miRNAs play a fundamental role in the development, progression, and prognosis of numerous human diseases [
15‐
20]. For instance, HIV-1 replication could be enhanced by miR-132 [
21] and similarly, cocaine could down-regulate miR-125b in CD4+ T cells to enhance HIV-1 replication [
22]. Breast neoplasms stem cell formation could be promoted by downregulation of miR-140 in basal-like early stage breast cancer [
23]. In addition, compared to normal epithelium, miR-139 and miR-140 was down-regulated during lobular neoplasia progression [
24]. The transcripts of certain let-7 homologs would be downregulated in human lung cancer and the low levels of let-7 would link to poor prognosis [
25]. In addition, non-small-cell lung cancer relates to many other miRNAs [
26‐
29].
Faced with a great variety of miRNAs and diseases, experimental methods for the sake of finding new associations between miRNAs and diseases, are both costly and time-consuming. In the wake of the growth of the biological datasets, the practicable computational methods are urgently necessary to greatly help identify more disease-related miRNAs and explore new perspective treatment of various important human diseases. Over the past decade, some progress has been made to uncover novel miRNA-disease associations. Most computational methods depends on the assumption that functionally similar miRNAs usually have connection with phenotypically similar diseases [
30‐
36]. From the standpoints of network and systems biology, most computational methods belonged to the similarity measure-based approaches or machine learning-based approaches.
A functionally related miRNA network and a human phenome-microRNAome network were first constructed by Jiang et al. [
37]. Then the disease phenotype similarity network, miRNA functional similarity network, and the known human disease-miRNA association network were combined together. Based on the combination, they devised a computational model of disease-miRNA prioritization, which could rank the entire human microRNAome for investigated diseases. However, its prediction performance was ordinary because of only using miRNA neighbor information. Furthermore, Xuan et al. [
38] proposed HDMP model to predict disease-related miRNA candidates on the basis of weighted
k most similar neighbors. In HMDP, miRNA functional similarity was calculated through the information content of disease terms and disease phenotype similarity. Then, the miRNA family (cluster) information was considered and miRNA functional similarity was recalculated after giving higher weight to members in the same miRNA family (cluster). However, the precision was directly influenced by the number of a miRNA’s neighbors. These two methods were limited by their local network similarity measure, which meant it was insufficient to simply consider miRNA neighbor information. Therefore, global network similarity measure was adopted in some studies. Chen et al. [
39] proposed Random Walk with Restart for MiRNA-disease association (RWRMDA), in which random walk analysis was applied to miRNA–miRNA functional similarity network. It was a pity that this method was the unavailability for diseases with no confirmed related miRNAs despite of its passable predictive accuracy. Xuan et al. [
40] further put forward a random walk method, MIDP, in which transition weights of labeled nodes were higher than unlabeled nodes. In MIDP, the side effect of the noisy data was reduced by fitting restart rate and MIDP is applicable for the disease with no related miRNAs.
Some other methods made use of the information about confirmed disease-related genes and predicted miRNA-target interactions. For instance, Shi et al. [
41] developed a computational prediction method in which random walk analysis was used in the protein–protein interaction (PPI) networks. It is assumed that if a target gene of a miRNA associates with a disease, this disease is likely to be related with the miRNA. MiRNA-target interactions and disease-gene associations were integrated into a PPI network and then the functional relationship information about miRNA targets and disease genes was dug out in this PPI network. Besides, this method could serve to find miRNA-disease co-regulated modules by hierarchical clustering analysis. Mørk et al. [
42] presented miRPD in which miRNA-protein-disease associations, not just miRNA-disease associations, were predicted. It was a good idea to bring in the abundant information of protein as a bridge indirectly linking the miRNA and the disease. In detail, known and predicted miRNA-protein associations were coupled with protein-disease associations from the literature to make an inference about miRNA-disease associations. In fact, the molecular bases for human diseases we had partly known accounted for less than 40% and highly accurate miRNA-target interactions can hardly be obtained. In other words, above two methods lacked solid data foundation. Chen et al. [
43] proposed a model based on super-disease and miRNA for potential miRNA-disease association prediction (SDMMDA). In view of the fact that rare miRNA-disease associations were known and many associations are ‘missing’, the concepts of ‘super-miRNA’ and ‘super-disease’ were introduced to improve the similarity measures of miRNAs and diseases.
The computational methods based on machine learning could bring us some new inspiration. Xu et al. [
44] constructed the miRNA-target dysregulated network (MTDN) and introduced support vector machine (SVM) classifier based on the features and changes in miRNA expression to distinguish positive miRNA-disease associations from negative associations. However, there was little confirmed information about negative samples, so improvement was needed. In view of the lack of negative samples, Chen et al. [
45] developed a semi-supervised method named Regularized Least Squares for MiRNA-disease association (RLSMDA). In the framework of regularized least squares, RLSMDA was a global method integrating disease semantic similarity, miRNA functional similarity and human miRNA-disease associations. RLSMDA could simultaneously prioritize all the possible miRNA-disease associations without the need of negative samples. Chen et al. [
46] proposed Restricted Boltzmann machine for multiple types of miRNA-disease association prediction (RBMMMDA) by which four types of miRNA-disease associations could be identified. RBMMMDA is the first model which could identify different types of miRNA-disease associations. There is a hypothesis that by distributional semantics, information attached to miRNAs and diseases can be revealed. Pasquier and Gardès [
47] developed a model named MirAI, in which the hypothesis was investigated by expressing distributional information of miRNAs and diseases in a high-dimensional vector space and then associations between miRNAs and diseases could be defined considering their vector similarity. Chen et al. [
39] introduced KNN algorithm into miRNA-disease association prediction and proposed the computational model of RKNNMDA (Ranking-based KNN for MiRNA-disease association prediction).
Some previous researches paid attention to the network tool-based prediction model. For instance, Xuan et al. [
40] divided network nodes into labeled nodes and unlabeled nodes and gave them different transition weights. The restart of walking could determine the walking distance, so the negative effect of noisy data would be lessened. Specially, the information from different layers of the miRNA-disease bilayer network was weighed differently. Then, Chen et al. [
48] developed Within and Between Score for MiRNA-disease association prediction (WBSMDA) in which for the first time, Gaussian interaction profile kernel similarity for diseases and miRNAs were combined with miRNA functional similarity, disease semantic similarity and miRNA-disease associations. Chen et al. [
49] further proposed Heterogeneous graph inference for miRNA-disease association prediction (HGIMDA) and the heterogeneous graph was constructed by the combination of miRNA functional similarity, disease semantic similarity, Gaussian interaction profile kernel similarity, and miRNA-disease associations. Similar to random walk, HGIMDA was an iterative process for the optimal solutions based on global network similarity. In aspect of AUC, HGIMDA reached 0.8781 and 0.8077 after implementing global and local LOOCV, respectively. Li et al. [
50] put forward MCMDA (Matrix Completion for MiRNA-disease association prediction) in which a matrix completion algorithm was introduced and the lowly ranked miRNA-disease matrix was updated efficiently. WBSMDA, HGIMDA and MCMDA apply to the disease (miRNA) without any proved related miRNAs (diseases). MaxFlow is a combinatorial prioritization algorithm proposed by Yu et al. [
51]. Besides the same type of data used in WBSMDA, MaxFlow also introduced the information about disease phenotypic similarity, miRNA family and miRNA cluster. Then a directed miRNAome-phenome network graph was constructed and every weighted edges were seen as flow capacity. The association possibility was defined as the flow quantity from the miRNA node to the investigated disease node. You et al. [
52] proposed Path-Based computational model for MiRNA-disease association prediction (PBMDA). A heterogeneous graph, including three interlinked sub-graphs, was constructed by the same data as in WBSMDA and depth-first search algorithm was applied to predict possible existing miRNA-disease associations. Chen et al. [
53] summed up the relatively important miRNA-disease association prediction approach.
More links should exist between miRNAs and diseases than we had learned. However, the computational methods aforementioned were limited by the utilization of inaccurate information (such as miRNA-target interactions), the selection of parameter values, the combination of different classifiers in the different networks or spaces, etc. In pursuit of the higher predictive accuracy, we proposed heterogeneous label propagation for MiRNA-disease association prediction (HLPMDA) for underlying miRNA-disease association prediction. In HLPMDA, heterogeneous data (miRNA similarity, disease similarity, miRNA-disease association, long non-coding RNA (lncRNA)-disease association and miRNA–lncRNA interaction) were integrated into a heterogeneous network [
54]. Then, disease-related miRNA prioritization problem was formulated as an optimization problem. In details, within-network smoothness and cross-network consistency were considered here. HLPMDA achieved AUCs of 0.9232, 0.8437 and 0.9218 ± 0.0004 based on global/local LOOCV and 5-fold cross validation, respectively. Both in local and global LOOCV, HLPMDA was better than previous methods. In the case studies of three human diseases, 47, 49 and 46 out top 50 predicted miRNAs for esophageal neoplasms, breast neoplasms and lymphoma were verified by some recent experimental research.
Discussion
The reliability and availability of HLPMDA lied in the following several aspects. Firstly, HMDD as well as other biological datasets provided a solid foundation for the subsequent prediction steps. Secondly, the introduction of lncRNA data and the application of bipartite network projection help profile the relationship between one miRNA and another miRNA, between one disease and another disease. There is a widely accepted view that more data may help produce a better output. Adding the corresponding lncRNA data brings more information to the problem of latent miRNA-disease association prediction. It is a fresh perspective and it was proved to be an advantageous improvement by the performance of HLPMDA. Bipartite network projection also dug out more implicit message that made the prediction more accurate. In addition, the heterogeneous label propagation is a useful algorithm based on the local and global feature in the constructed network, with no need of negative examples. In recent years, the network approach has been relatively widely adopted in some fields of bioinformatics [
79‐
81]. The major cause is that similarity, links, associations, interactions and relationships among the research targets (like miRNA, diseases and so on) in the network approach become easier to be represented, calculated, analyzed and tested by some math tools, together with some descriptive expressions transformed into quantitative representations. As a result, it indeed helps improve the effectiveness of the prediction. Finally, according to NanoString’s Hallmarks of Cancer Panel collection (
https://www.nanostring.com/), it is proved that a part of the miRNAs’ targets is related to cancer hallmarks [
82,
83], which were found to be associated with the corresponding genes. So our work may be helpful for the further research about cancer hallmarks, genes and miRNA.
However, HLPMDA is undeniably limited by following factors which are also the room to improve HLPMDA. First, the data about miRNA and disease is not ample enough. For instance, the known miRNA-disease associations have a large degree of sparsity (labeled miRNA-disease associations only accounts for 2.86% of 189,585 miRNA-disease pairs). It is believed that more data could promote the performance of the computational model. Therefore, with more information about miRNA, disease and some other objects (like genes, drugs, targets and so on) related to one or both of them put to use [
84], predictive power of HLPMDA would be stronger. Second, it may be unfair for different miRNAs or diseases because the known information about every item is not relatively equivalent. Therefore, HLPMDA may cause advantageous bias to miRNAs or diseases which have more known association (or interaction) records. Last but not the least, the parameters in HLPMDA were set according to the previous similar studies and our experience. We have not thought a lot of the parameters but there may exist better parameters which could bring about more accurate prediction results.
Data collection, database construction, data analysis, mining and testing about miRNA-disease associations has become an important field in bioinformatics. As we all know, there are strong connections in many fields of biology. The research of miRNA-disease association relates to protein–protein interaction, miRNA-target interaction, miRNA–lncRNA interaction, drug, environmental factor, etc. In the future, we believe that this field need to obtain more data and to be integrated with other research areas for the sake of producing predictive synergy with more integrated data.