Elsevier

Vaccine

Volume 22, Issues 23–24, 13 August 2004, Pages 3195-3204
Vaccine

Prediction of CTL epitopes using QM, SVM and ANN techniques

https://doi.org/10.1016/j.vaccine.2004.02.005Get rights and content

Abstract

Cytotoxic T lymphocyte (CTL) epitopes are potential candidates for subunit vaccine design for various diseases. Most of the existing T cell epitope prediction methods are indirect methods that predict MHC class I binders instead of CTL epitopes. In this study, a systematic attempt has been made to develop a direct method for predicting CTL epitopes from an antigenic sequence. This method is based on quantitative matrix (QM) and machine learning techniques such as Support Vector Machine (SVM) and Artificial Neural Network (ANN). This method has been trained and tested on non-redundant dataset of T cell epitopes and non-epitopes that includes 1137 experimentally proven MHC class I restricted T cell epitopes. The accuracy of QM-, ANN- and SVM-based methods was 70.0, 72.2 and 75.2%, respectively. The performance of these methods has been evaluated through Leave One Out Cross-Validation (LOOCV) at a cutoff score where sensitivity and specificity was nearly equal. Finally, both machine-learning methods were used for consensus and combined prediction of CTL epitopes. The performances of these methods were evaluated on blind dataset where machine learning-based methods perform better than QM-based method. We also demonstrated through subgroup analysis that our methods can discriminate between T-cell epitopes and MHC binders (non-epitopes). In brief this method allows prediction of CTL epitopes using QM, SVM, ANN approaches. The method also facilitates prediction of MHC restriction in predicted T cell epitopes. The method is available at http://www.imtech.res.in/raghava/ctlpred/.

Introduction

T cells are a vital component of the machinery of protective immunity, both directly by recognizing and eliminating the self-altered cells and indirectly by controlling the production of antibodies by the cells of B lineage [1]. The former function is controlled by cytotoxic T lymphocytes (CTL) [2]. The CTL cells recognize proteolysed fragments of the protein in combination with MHC class I molecules [3], [4]. They recognize short peptides of 8–10 amino acids. The interaction of T cell receptor (TCR) with MHC peptide complex can be highly flexible, so that a single TCR can recognize large number of peptides in the context of single MHC molecule [5]. Hence, identification of CTL epitopes is crucial in understanding the rules of T cell activation and designing of synthetic vaccines [6]. The identification of CTL epitopes have paved a way towards cancer immunotherapy and many other infectious diseases.

In the past, a number of methods have been developed for prediction of T cell epitopes from protein sequences. These methods can be classified as direct and indirect methods. In 1980s, direct prediction methods based on structural and sequential analysis of T cell epitopes were developed [7], [8], [9], [10]. DeLisi and Berzofsky [7] proposed that the critical requirement of T cell epitopes is its ability to form stable amphipathic structure. Based on this hypothesis, a program AMPHI was developed [8], [9]. Another algorithm SOHHA was developed based on the assumption that T cell epitopes consist of a helix of 3–5 helical turns with a narrow strip of hydrophobic residues on one side. These approaches were superseded after analysis of MHC peptide complex by X-ray crystallography, which demonstrated that peptide bound in MHC groove have extended conformation [12].

Sequential models for T cell epitope prediction were also developed, which relies on the occurrence of motifs in the primary sequence rather than considering the secondary structure [13], [14], [15]. In 1988, Rothbard and Taylor collected nearly 57 T cell epitopes and based on the patterns, they published a list of motifs [14]. The proposed motifs are 3–4 residues consisting of glycine followed by hydrophobic residues. Further, an algorithm was developed based on association of cysteine containing T cell epitopes and certain other residues. The algorithm searches for triplets including CAK, CLV, CKL and CGS in the peptide sequence [13]. In 1995, two computational T cell epitope prediction tools EpiMer and OptiMer were developed based on knowledge of MHC binding motifs [11]. OptiMer predicts amphipathic segments of protein with high motif density and EpiMer locates the segments of protein with high motif density. These direct prediction methods based on structural or sequential models have low accuracy [16]. The main cause of low accuracy may be insufficient data and less specificity of T cell receptors (TCRs).

In the last decade, a number of indirect methods have been developed that predict MHC binders instead of T cell epitopes. The currently available indirect methods are based on structure, binding motifs, matrices or Artificial Neural networks (ANNs) [17], [18], [19], [20], [21], [22], [23], [24]. Due to more specific interaction of MHC and peptides, performance of these methods are better in comparison to direct T cell epitope prediction methods. The major limitation of these methods is that they cannot discriminate between T cell epitopes and non-epitope MHC binders. These methods only predict the MHC binders from antigenic sequences.

In this study, an attempt has been made to develop a direct method for prediction of CTL epitopes. The data of CTL epitopes and non-epitopes was obtained from MHCBN version 1.1, a comprehensive database of MHC binders and non-binders [25]. The methods based on QM, SVM and ANN have been developed to discriminate CTL epitope and non-epitopes.

The methods based on QM, ANN and SVM achieved an accuracy of 70.0, 72.2 and 75.5%, respectively, when evaluated through Leave One Out Cross-Validation (LOOCV). The results clearly illustrate that machine-learning techniques are better in comparison to quantitative matrices. The performance of machine learning techniques was further enhanced by devising consensus and combined approaches based on SVM and ANN. The combined prediction approach achieved a sensitivity of 79.4%, which is better as compared to any individual methods. The specificity of consensus approach is 88.4%, which is better as compared to any other individual methods.

The methods developed in this study were also evaluated on a blind dataset that does not contain any pattern used in training or testing. The performance of these methods were evaluated on two subgroups: (i) one subgroup having CTL epitopes and non-epitopes MHC binders, (ii) second subgroup having CTL epitopes and MHC non-binders. The performance of all methods was fairly good on both subgroups as shown in Table 6. This demonstrates that methods developed in this study are able to discriminate between CTL epitopes and non-epitopes MHC binders, which is not possible through MHC binder prediction methods.

Finally, MHC restriction of predicted CTL epitopes were examined using quantitative matrices-based MHC binder prediction method [23]. The quantitative matrices-based method will determine MHC binding specificity of T cell epitopes. A schematic view of prediction method has been shown in Fig. 1. In summary, this comprehensive method will speed up the process of vaccine development for various dreadful diseases like cancer and AIDS.

Section snippets

Datasets

All peptide sequences of the CTL epitopes and non-epitopes were drawn from MHCBN version 1.1 [25]. Initially, 1334 CTL epitopes of 9 amino acids with varying T cell activity were obtained from the database. All duplicate epitopes and epitopes having unnatural amino acids were removed. The final dataset consisted of 1137 CTL epitopes interacting with nearly 170 MHC class I molecules. A total of 340 CTL non-epitopes of 9 or more amino acids were extracted from MHCBN. They were chopped to obtain

Quantitative matrices

In case of QM, the contribution of each residue for each position of peptide in T cell activity was quantified. A matrix with weights for each amino acid residue in every position of peptide was generated using Eq. (1). The QM is shown in Table 1. The effect of each residue on T cell activity of peptide could be easily estimated. The QM-based method was able to classify the data with 70.0% accuracy at default threshold where sensitivity and specificity of prediction was nearly equal. The

Discussion and conclusions

It was observed in mid 1990s that the performance of all the previously published T cell epitope prediction methods was quite poor [16]. The performance of these methods were not even significantly better than random prediction. The lack of sufficient amount of data about T cell epitopes may be the prime cause of poor performance [16]. The success of a prediction method depends on the quality and quantity of data. To predict T cell epitopes with fair accuracy, a large number of MHC binders

Acknowledgements

The authors are thankful to Sanjoy Paul and Amrita Lama for carefully reading the manuscript. The authors are thankful to Council of Scientific and Industrial Research (CSIR) and Department of Biotechnology (DBT), Govt. of India for financial assistance. Manoj Bhasin is a recipient of a fellowship from CSIR. This report has IMTECH communication No. 016/2003.

References (34)

  • C. Watts et al.

    Pathways of antigen processing and presentation

    Rev. Immunogenet

    (1999)
  • S. Brunak et al.

    Identifying cytotoxic T cell epitopes from genomic and proteomic information: “The human MHC project”

    Rev. Immunogenet

    (2000)
  • C. DeLisi et al.

    T-cell antigenic sites tend to be amphipathic structures

    Proc. Natl. Acad. Sci. U.S.A

    (1985)
  • Cornette JL, Margalit H, DeLisi C, Berzofsky JA. The amphipathic helix as a structural feature involved in T cell...
  • J.L. Spouge et al.

    Strong conformational propensities enhance T cell antigenicity

    J. Immunol

    (1987)
  • L.J. Stern et al.

    Crystal structure of the human class II MHC protein HLA-DR1 complexed with an influenza virus peptide

    Nature

    (1994)
  • S. Mouritsen et al.

    T-helper-cell determinants in protein antigens are preferentially located in cysteine-rich antigen segments resistant to proteolytic cleavage by cathepsin BL, D Scand

    J. Immunol

    (1991)
  • Cited by (312)

    • Genome-based solutions for managing mucormycosis

      2024, Advances in Protein Chemistry and Structural Biology
    View all citing articles on Scopus

    Supplementary data associated with this article can be found at doi: 10.1016/j.vaccine.2004.02.005.

    View full text