Background
Poor delivery and low bioavailability of therapeutic molecules are the two main obstacles in the drug development process. The plasma membrane is selectively permeable and remains a major barrier for most of the therapeutic molecules. In order to overcome this barrier, a number of delivery systems have been developed over the years [
1,
2]. Despite the tremendous progress, the existing delivery methods can result in high toxicity, immunogenicity and low delivery yield. In the last decade, short peptides known as cell penetrating peptides (CPPs) or protein transduction domains (PTDs) have gained much recognition as an efficient delivery vehicle [
3]. CPPs have remarkable ability to transverse eukaryotic membranes without significant membrane damage. In addition, CPPs can carry a variety of cargoes like peptides [
4,
5], proteins [
6], drugs [
7,
8], nucleic acids [
9,
10], siRNAs [
11,
12], nanoparticles [
13,
14],
etc. across the cell membrane. Almost everything can be transported into the cell, once conjugated to CPP [
15]. Thus, CPPs have a great therapeutic potential, especially in drug delivery. Although first CPP has been discovered 25 years ago, their mechanism of uptake is still not very clear. However, two routes of internalization have been proposed that include direct penetration and endocytic pathway [
16].
Since the discovery of first CPP,
i.e. Tat (transcription activator of the human immunodeficiency virus type 1) peptide, hundreds of CPPs have been discovered so far with varied length and physicochemical properties [
17]. Most of these peptides are short (up to 35 amino acids), water soluble, partly hydrophobic, and/or polybasic in nature with a net positive charge at physiological pH [
18]. In the past, few attempts have been made to develop computational methods for CPP prediction [
19‐
22]. In 2008, Hansen
et al. developed a method, which involves a set of z-scales of 87 coded and non-coded amino acids published by Sandberg and his group [
23]. z-scales require a lot of variables like molecular weight, molecular orbital calculations, proton NMR shift,
etc. Finally, z-scores obtained are used to predict the CPPs. This method gave 68% prediction efficiency, which is very poor to distinguish CPPs from the non-CPPs. In 2010, Dobchev
et al. used quantitative structure-activity relationship (QSAR) and artificial neural network (ANN) models to predict CPPs. They achieved maximum accuracy of 83%. In this method, sequences that are difficult to predict were excluded. In a recent study, Sanders
et al. (2011) have used support vector machine (SVM) models to predict CPPs on five different datasets. They used various biochemical properties to develop SVM models. One of the major limitations with the previous methods is that datasets used for training were very small (< 111) and none of the methods is available in the form of web service for public use. In addition, most of the previous methods have used unbalanced datasets, which presents many problems for machine learning classifiers. This point has also been highlighted earlier by Sanders
et al. in their study, where they have used both balanced and unbalanced datasets for machine learning. In balanced dataset, they achieved 95% accuracy and 75% accuracy was achieved in unbalanced dataset. This poor performance of SVM with unbalanced dataset is due to the inherent learning biases of unbalanced dataset, demonstrating the need for balanced datasets for avoiding biases in machine learning.
In the present study, we have made a systematic attempt to complement existing methods for predicting CPPs with high accuracy. We have used large dataset (708 CPPs) for training, testing and evaluating our models. The dataset is derived from the CPPsite, which is the first database of experimentally validated CPPs [
24]. We have used various features like amino acid composition, dipeptide composition, binary profiles of pattern, and physicochemical properties as input for developing SVM models. In addition, we have also identified various CPP specific motifs, which have been used to develop a hybrid model. For the first time, a prediction web tool has been developed to assist the scientific community working in the area of CPPs.
Methods
Main datasets
We have extracted 843 experimentally validated CPPs from the CPPsite database, which has been developed by our group [
24]. The peptides containing non-natural amino acids (
e.g. selenocysteine) or having D-amino acids (D-conformation) were removed. Finally, we have got 708 unique CPPs having natural amino acids. Three different datasets (CPPsite-1, CPPsite-2 and CPPsite-3) have been created from these peptides. Since very few peptides have been experimentally validated as non-CPPs (negative examples), equal number of peptides (15–30 amino acids) were generated randomly from SwissProt proteins, and considered them as non-CPPs. This strategy for creating negative dataset has already been used in previous studies [
22,
25].
First dataset (CPPsite-1) contains 708 CPPs (positive examples) and 708 non-CPPs (negative examples). In CPPsite-1, CPPs having wide range of uptake efficiency (low and high) have been included, thus we have derived another dataset CPPsite-2 from CPPsite-1. CPPsite-2 contains 187 CPPs having high uptake efficiency and equal number of non-CPPs. We have created third dataset (CPPsite-3), which contains 187 CPPs having high uptake efficacy as positive examples and equal number of CPPs with low uptake efficiency were taken as negative examples. The model based on CPPsite-3 dataset can discriminate between high and low efficient CPPs.
All datasets (CPPsite-1, CPPsite-2 and CPPsite-3) consist of several CPPs with all possible Ala-scan mutants, or different truncations. Ideally redundancy in the datasets should be removed because it affects the performance of prediction method. In past, our group has removed the redundancy in various prediction methods [
25,
26]. But in this study, we have not removed the redundancy in CPP datasets because a single residue can affect the uptake efficiency of CPPs, and this may also lead to the loss of information about CPPs. In order to check the performance of our model on redundant dataset, we have used some benchmark datasets, which are redundant.
Benchmark datasets
In order to compare our method with existing methods, we have extracted datasets from literature that have been used in previous studies. Sanders
et al. (2011) have developed a method for CPP prediction. In this study, they have used 111 experimentally validated CPPs and equal number of non-CPPs (generated randomly from the chicken proteome). We have named this dataset Sanders-2011a. Second dataset from Sanders
et al. (2011) named Sanders-2011b, which contains 111 CPPs and 34 experimentally validated non-CPPs. We have also generated a third dataset Sanders-2011c consisting of 111 CPPs, and 111 non-CPPs randomly sampled from 34 known non-CPPs. Dobchev
et al. (2010) have used 74 CPPs and 24 non-CPPs for developing method for CPP prediction. These peptides were collected from the literature. We have used this dataset in this study and named Dobchev-2010. Similarly, we have created datasets Hansen-2008 (containing 66 CPPs & 19 non-CPPs) [
20] and Hallbrink-2005 (containing 53 CPPs & 16 non-CPPs) from previous studies [
19].
Independent dataset
In order to evaluate the performance of our method, we have created an independent dataset of 99 novel CPPs, which have not been included in the training, feature selection and parameter optimization of the model. These peptides have been collected manually from recent research papers and patents.
Cross-validation technique
The validation of any prediction method is very essential part. In the present study, five-fold cross-validation technique was used to evaluate the performance of all the models. Here, sequences are randomly divided into five sets, of which four sets are used for training and the remaining fifth set for testing. The process is repeated five times in such a way that each set is used once for testing. Final performance is obtained by averaging the performance of all the five sets. In this study, we have also used jack-knife cross validation or Leave One Out Validation (LOOV) technique for evaluating performance of our models. In this technique, one sample is used for testing and remaining samples for training, this process is repeated in such a manner that each sample is used only once for testing.
Support vector machine
We have used a highly successful machine learning classifier known as SVM for building prediction models. Therefore, we implemented SVM
light Version 6.02 package of SVM [
27] and machine learning was carried out using various kernels (
e.g. linear, polynomial, radial basis function and sigmoid tanh), where each input dot is converted into nonlinear kernel function. Here, we used RBF kernel of SVM at different parameter; g ∈ [10
-4 - 10], c ∈ [
1‐
15], j ∈ [
1‐
5] for optimizing the SVM performance to get the best performance. SVM requires a set of fixed length of input features for training, thus necessitating a strategy for encapsulating the global information about proteins/peptides of variable length in a fixed length format. The fixed length format was obtained from protein/peptide sequences of variable length using amino acid composition, dipeptide composition and binary profile of pattern. After training, learned model can be used for the prediction of unknown examples.
Amino acid composition
Peptide information can be encapsulated in a vector of 20 dimensions, using amino acid composition of the peptide. The amino acid composition is the fraction of each amino acid type within a peptide. The fractions of all 20 natural amino acids were calculated by using the following equation:
Where Comp (i) is the percent composition of amino acid (i); R
i
is number of residues of type i, and N is the total number of residues in the peptide.
Dipeptide composition
The dipeptide composition provides composition of pair of residues (
e.g. Ala-Ala, Ala-Leu,
etc.) present in peptide, and used to transform the variable length of peptides to fixed length feature vectors. It gives a fixed pattern length of 400 (20 × 20), and encapsulates information about the fraction of amino acids as well as their local order. It is calculated using following equation:
Where dipeptide (i) is one out of 400 dipeptides.
Binary profile of patterns
Binary profiles were generated for each peptide, where each amino acid is represented by a vector of dimensions of 20 (
e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) as described in supporting information (Additional file
1: Figure S1). A pattern of window length W was represented by a vector of dimensions 20 × W. We have created binary profile for first 5 and 10 residues from N-terminus, similarly for last 5 and 10 residues from C-terminus of peptides in all datasets. The binary profile has been used in a number of existing methods [
28,
29].
Physicochemical properties
Physicochemical properties like amphipathicity, hydrophobicity, charge, length,
etc. have been previously shown to be useful in the prediction of CPPs [
20,
22]. We have calculated these properties (amphipathicity, hydrophobicity, charge, molecular weight, length, isoelectric point, side chain bulk, steric bulk, net donated hydrogen bonds, and number of polar and non-polar residues) of amino acids to develop prediction models for CPPs. We have taken numerical values of these physicochemical properties from latest version of AA index database [
30].
Sequence logos
The sequence logos were generated using online WebLogo software [
31]. The sequence logo gives the position specific frequency of amino acids in peptides. Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino acid at that position.
MEME/MAST motifs
We have observed various common patterns/motifs in CPPs. In order to identify motifs in CPPs, we have used MEME/MAST program [
32]. In the present study, meme-4.7.0 version was used. We got the number of motifs in CPPs using MEME, and these motifs have been used further to scan peptides for the presence of CPP specific motifs using program MAST. Hits obtained in the MAST output were used to calculate the efficacy and coverage of MEME/MAST method. E-value is very crucial in the MAST output, so we took this into account and calculated the efficacy of this method at different E-values (10-10
-7).
Hybrid approach
In hybrid approach, we have combined SVM output with motif information obtained by MEME/MAST for the better and biologically reliable prediction of CPPs. In this approach, for a query peptide, first SVM model is applied and it generates an SVM score. In parallel, the query peptide is searched against the CPP motifs, if any motif is found in the peptide; its SVM score is increased by a value of 5, so that in any case, it would be predicted as positive whatever is the original prediction.
The performance of various models developed in this study was computed using threshold-dependent as well as threshold-independent parameters. In threshold dependent parameters we used sensitivity (Sn), specificity (Sp), overall accuracy (Ac) and Matthew’s correlation coefficient (MCC) using following equations.
Where TP and TN are correctly predicted positive and negative examples, respectively. Similarly, FP and FN are wrongly predicted positive and negative examples respectively.
We created ROC (Receiver Operating Characteristic) for all of the models in order to evaluate performance of models using threshold-independent parameters. ROC plots with area under curve (AUC) were created using ROCR statistical package available in R [
33].
Discussion
Due to huge therapeutic applications of CPPs, especially in drug delivery, identification of novel and highly efficient CPPs is need of the hour. However, identification of highly efficient CPPs is a very tedious task for biologists. One has to scan the whole protein in overlapping window patterns, and every peptide has to be tested for the possible cell penetrating activity, which is a very laborious and time consuming cycle. A computational method, which can determine whether a peptide sequence can be a CPP or not, would definitely help biologists for rapid screening of CPPs before synthesis and thus, accelerate the CPP-based research. The development of an
in silico method for CPP prediction is very challenging due to three major reasons; (i) CPPs have lot of variation in size (5 – 30 amino acids), and machine learning software need fixed length patterns as input to develop model, (ii) experimentally proven non-CPPs (negative dataset) are not reported in literature, which are very important for developing the
in silico method, and (iii) other major problem in CPP prediction is the lack of dataset of peptides (CPPs and non-CPPs) tested in similar experimental conditions (
e.g. concentrations, incubation time, cell lines, type of cargoes,
etc.). In most of the CPP-based research, uptake of peptides has been tested on different cell lines with different experimental conditions. It could be possible that few non-penetrating analogues of previously known CPPs may act as CPP when evaluated on alternative cell lines or in different experimental conditions. Sanders
et al. have also observed a similar observation, where a previously known non-CPP found to have some penetrating properties when tested on different cell lines (
i.e. avian cell line) [
22]. Therefore, for the better and more accurate prediction, larger dataset of CPPs and non-CPPs tested in number of cell lines with similar experimental conditions are required. However, in the past, few attempts have been made to predict CPPs [
19‐
22], but all these methods used very small dataset and none of these has provided web service. In the last decade, a large amount of data on the use of CPPs as delivery agents has accumulated and this enormous growth of CPP data motivated us to develop an
in silico method on a larger dataset of 708 experimentally validated CPPs. In order to develop a robust computational method, which can discriminate CPPs from non-CPPs with higher accuracy, we have developed SVM models on three datasets (CPPsite-1, CPPsite-2 and CPPsite-3) using many features like amino acid composition, dipeptide composition, binary pattern of profile and CPP motifs.
Performances of SVM models developed on dataset CPPsite-1 and CPPsite-2 were significantly better than models developed on CPPsite-3 dataset. This is due the fact that in CPPsite-3, both positive and negative examples are CPPs; the only difference is that positive examples consist of CPPs with high uptake efficiency, while negative examples consist of CPPs with low uptake efficiency. Since peptides in both the classes are CPPs and contain similar properties including amino acid composition (Additional file
1: Figure S2), they are difficult to discriminate.
SVM models using amino acid and dipeptide composition as input features performed reasonably good and achieved more or less similar accuracy. Recently, Sanders
et al. (2011) published a method, in which they have used amino acid compositions and 41 other biochemical properties, including amino acid frequency, length, hydrophobicity,
etc. as an input feature to develop SVM model. We have shown that amino acid composition alone can predict CPP with better accuracy (Table
7). The dipeptide-based model achieved greater accuracy (98%) for Sanders-2011a dataset, while the increase in accuracy (95.94% to 96.40%) for whole amino acid composition-based model for Sanders-2011a dataset is negligible and could be due to the random sampling of negative examples. One of the limitations in composition-based model is that it only computes the overall number of residues in peptides and loses the amino acid order information, which is equally important. It is well known that the peptide’s function is strongly related to its sequence order. Evidence suggests that conformation of CPPs plays a crucial role in membrane interaction and insertion [
38]. It has been shown that CPP with helical conformation can penetrate membrane more effectively than the peptides with other conformations [
38]. Many amphipathic CPPs adopt helical conformation in which all the polar residues grouped at one face and the nonpolar residues to the opposite face of the helix. This amphipathic helical distribution can also be associated to specific amino acids and with a particular order. In addition, preliminary analysis (Figures
2 and
3) has also shown that certain residues are preferred at specific positions in CPPs. Therefore, in order to include this information, we have developed SVM models based on binary profile of patterns, which incorporates information of both composition and amino acid order. In many previous studies, binary profiles patterns-based SVM model performed better than composition-based model [
25,
26]. In this study also, N10-C10 binary profile-based SVM model achieved maximum accuracy (93.51%) in CPPsite-2 dataset.
In addition, we have also developed motif-based method using MEME/MAST, where MEME is used to discover motifs and MAST is used to search these motifs in CPPs. We conducted our study keeping in mind that the CPPs might share some patterns/motifs. This approach has been used successfully in the past to differentiate two different classes of peptides [
37]. In the present study also, the model developed on motif-based approach has predicted CPPs with reasonable accuracy. Finally, in order to improve performance of the model, a hybrid model using both binary profile patterns and motif information was developed. Motif information has further increased the accuracy of CPP prediction. We also compared our method with existing methods on benchmark datasets. The performance of our method was better than existing methods. Furthermore, in order to help biologists, we have implemented our best models in a user-friendly web server CellPPD.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AG collected the data and created the datasets. KC, RK and AS developed computer programs, implemented SVM. KC, RK and AS created the back end server. KC, RK, AG, PK and AT developed the front end user interface. AG and RK wrote the manuscript. GPSR conceived and coordinated the project, helped in the interpretation of data, refined the drafted manuscript and gave overall supervision to the project. All of the authors read and approved the final manuscript.