Background
The term T cell repertoire describes a collection of lymphocytes characterized by T cell receptor (TCR) expression, which plays a critical role in antigen recognition. Since alterations of the T cell repertoire provide a significant indication of immune status in physiological and disease conditions, T cell repertoire analyses have been conducted for the identification of antigen-specific T cells involved in the development of disease and for the diagnosis of T lymphocyte abnormalities. Comparison of variable-region usage by fluorescence-activated cell sorter analysis using a large panel of antibodies specific for TCR variable regions [
1‐
4], polymerase chain reaction (PCR) with multiple primers [
5] or PCR-based enzyme-linked immunosorbent assay [
6,
7] have been widely used to detect changes in T cell repertoire. Length distribution analysis known as CDR3 spectratyping is based on the addition of non-template nucleotides in V-(D)-J region and has been used to evaluate T cell clonality and diversity [
8,
9]. To identify the antigen specificity of T cells further, PCR cloning of TCR clonotypes and subsequent sequence determination of the antigen recognition region, CDR3, have been required. These conventional approaches are commonly used but are time-consuming and a laborious way to study TCR repertoires.
In recent years, advances in high-throughput sequencing technologies known as next-generation sequencing (NGS) have rapidly progressed and enabled large-scale analysis of sequence data [
10,
11]. Although several NGS-based TCR repertoire analysis systems have been developed by other researches, many amplification techniques are based on multiple PCR with different primers specific for each variable region. Thus, bias during PCR amplification is unavoidable since bias is most commonly due to differential hybridization kinetics among variable region-specific primers to different target genes. Correction and additional computational normalization methods are therefore required to minimize PCR bias when using multiple PCR assays [
12]. The use of a single set of primers is a better way to achieve unbiased and quantitative amplification of all TCR genes including unknown variants where the 5’ ends of sequences are highly diverse. A single strand oligonucleotide anchor ligation to the 3’ end of cDNA with T4 RNA ligase [
13], homopolymeric tailing of cDNA, 5’ rapid amplification of cDNA ends (RACE) [
14] and template switching PCR (TS-PCR or SMART PCR) [
15] have been used to analyze TCR repertoires [
16,
17]. TS-PCR is simple and convenient but produces high levels of background amplification because TS primers non-specifically anneal to random regions in RNA or allow the repeated addition of TS primers [
18,
19]. Thus, the current study describes an adaptor-ligation mediated PCR (AL-PCR) developed by the addition of an adaptor to the 5’ end of double stranded (ds) cDNA from TCR transcripts and subsequent PCR amplification with the adaptor primer and constant region-specific primer, as first reported by Tsuruta et al. [
20,
21]. The adaptor ligation to blunt-ended ds cDNA is less influenced by the sequence of a particular cDNA while the efficiency of 5’ adaptor ligation with T4 RNA ligase is sequence dependent [
22]. In addition, the ligation of dsDNA by T4 ligase is more efficient than ssDNA ligation with T4 RNA ligase in ligation anchored PCR (LA-PCR).
Various sequencing technologies such as Roche 454 (San Francisco, CA), Illumina (San Diego, CA), Ion-Torrent (Life Technologies, Grand Island, NY), SOLiD (Life Technologies), Helicos (Cambridge, MA) and PacBio (Menlo Park, CA) have been developed. Among these NGS platforms, the 454 DNA sequencing produces sequence reads ranging from 50 to 600 base pairs (bp) or more in length and sufficient read outputs, yet less reads per run than the Illumina. Long read sequencing allows determination of the full or near-complete length of TCR genes including V, D, J and C regions. Furthermore, recombinant TCR proteins can be easily produced by subsequent PCR cloning of the TCR genes. Therefore, we applied an adaptor-ligation mediated PCR method to NGS with 454 DNA sequencing.
Natural killer T (NKT) cells are a distinct T cell population with an important role in innate and adaptive immunity. NKT cells regulate a broad range of immune responses such as autoimmune diseases, tumor surveillance, and host defense against pathogenic infections. NKT cells express an invariant TCRα consisting of Vα24 and Jα18 that recognizes glycolipids presented by a non-classical major histocompatibility complex class I-related protein, CD1d [
23]. Recently, mucosal-associated invariant T (MAIT) cells, which preferentially exist in mucosal tissues, were shown to be a unique T cell population expressing a semi-invariant TCRα consisting of Vα7.2 and Jα33. MAIT cells recognize microbial vitamin B metabolites presented by a non-classical MHC class I molecule, MHC-related protein 1 (MR1) [
24]. These T cell populations bearing invariant TCRα play a pivotal role in immune regulation but it remains to be determined whether all invariant TCRα are expressed by these unique T cell populations.
In this study, we conducted NGS sequencing of TCR transcripts from 20 healthy individuals using a newly developed NGS-based TCR repertoire analysis. Initially, based on sequence read count, we examined usages of variable and joining regions, and further analyzed clonality and diversity in TCRα and β genes. Unique sequence reads identified using an originally developed gene analysis program were compared at a clonal level among healthy individuals. These results showed a similar usage of TRV and TRJ and similar extent of diversity of T cells among individuals. Interestingly, TCRβ reads were less shared among individuals while TCRα reads frequently contained shared sequences that overlapped between two or more individuals. Shared TCRα reads contained a high proportion of invariant TCRα indicating the presence of iNKT cells or MAIT cells.
In this report, we demonstrated that analysis of TCR genes shared among multiple individuals from NGS data provided significant information on invariant TCRs expressed by NKT cells and MAIT cells.
Discussion
High-throughput sequencing technologies have taken a great leap forward with the development of a wide variety of NGS platforms. NGS facilitates the acquisition of an enormous amount of sequence data but still requires PCR amplification or gene enrichment to sequence genes of specific interest instead of the entire genome or gene library. For heterogeneous TCR or BCR genes generated by rearrangement of many gene segments, multiplex PCR with many sets of gene-specific primers have been widely used. However, the use of multiple primers causes amplification bias between respective genes, hampering the accurate estimation of gene frequency. Here, we used an unbiased PCR technique, an adaptor-ligation mediated PCR, for NGS-based TCR repertoire analysis. The method uses a single set of primers to avoid PCR bias by competition between primers. Therefore, it is better suited to estimate accurately the abundances of respective TCR genes from a wide variety of samples.
We comprehensively examined TRA and TRB repertoires at the clonal level from a large number of individuals (
n = 20) and evaluated a large number amount of sequence data (total 149,216 unique sequence reads from 267,037 sequence reads). Thus, this study precisely revealed the normal range of gene usage as well the extent of diversity and similarity of TCR repertoires in healthy individuals. Compared with the Illumina NGS platform [
16,
17,
33], sample sequence reads were less numerous but were longer and of higher quality. Using the Illumina platform, a different sequence depth among CDR3 contig generated from many shotgun reads may make it difficult to determine the frequency of TCR clonotypes. However, all TCR sequences were determined from a single read and had long sequences that covered the entire region of CDR3, V and J (Mean ~400 bp, Additional file
1: Tables S1 and S2). Direct analysis from read sequences without assembly is likely to reflect accurately the actual frequencies of TCR clonotypes. Error rates in TCR sequences were slightly less than a previous report showing a mean error rate for 454-sequences of 1.07 % [
27], suggesting high levels of accuracy and quality irrespective of nested PCR. Homopolymeric stretches within coding regions occur typically with the 454-sequence methodology, causing a frame shift in coding sequence. This leads to higher rate of production of out-of-frame reads in the 454-sequence compared with other sequence platforms. Bolotin et al. has previously reported that higher percentage of mismatch-containing sequencing in the illumina than in the Roche 454 and Ion Torrent datasets (3.2, 1.4 and 1.2 %) [
34]. The error rate obtained in this study seems to be relatively lower than that in the previous report, even though our data showed higher error rate in out-of-frame than in-frame reads. This supports that the error rates obtained are lower than in the illumina although our results did not provide a direct evidence. Furthermore, the assignment and aggregation software, RG, can rapidly summarize usage as well as recombination usage of TRV and TRJ. This integrated analysis easily allows the detection of preferential usage of a given TRV and/or TRJ and therefore it will be useful for studying immune responses by antigen-specific T cells.
Unlike widely used multiplex PCR methods that typically require compensation for PCR bias [
12], the AL-PCR method is supposed to accurately estimate TCR repertoires without the compensation. High expression levels of TRBV18 (BV18S1, Arden’s nomenclature), TRBV19 (BV17S1) and TRBV7-9 (BV6S5) as well as low expression levels of TRBV20-1 (BV2S1), TRBV28 (BV3S1) and TRBV29-1 (BV4S1) were reported in CD4+ and CD8+ cells by multiplex PCR [
35]. However, flow cytometry analysis showed that TRBV20 and TRBV29 were abundantly expressed in PBLs [
1,
36,
37]. To examine difference of accuracy between AL-PCR and multiplex PCR in detail, we compared usage of TRBV obtained with either AL-PCR or multiplex PCR with FACS data reported previously by van den Beemd et al. [
1]. The result indicated that AL-PCR method was better correlated with FACS method than Multiplex PCR method, suggesting that the AL-PCR method with a set of universe primers enables us to accurately analyze TCR repertoires. In addition, our results of TCR repertoires are similar to a previous report [
38]. Therefore, this method will provide direct, accurate and dependable results of TCR repertoires.
By comparison to large number of healthy individuals, it has been disclosed that disease patients with X-linked agammaglobulinemia [
39] or Common Variable Immune Deficiency (CVID) [
40] had skewed and contracted TCR repertoires. It is important to clarify TCR repertoires of healthy individuals in considering how much disease patients differ from normal. Regarding usages of TRV and TRJ repertoires, we observed preferential usages of TRV and TRJ in peripheral bloods from healthy individuals. Similar usages between in-frame (productive) and out-of-frame (unproductive) reads suggests that the preferential usages are unlikely due to peripheral selection. Given the preferential usage was observed in immature T cells [
41], this is likely to be influenced by genetic factors such as recombination process.
Recombination usage exhibited infrequent recombinations of AJ-proximal 3’ AV segment to AV-distal 3’ AJ segment and AJ-distal 5’ AV segment to AV-proximal 5’ AJ segments. In gene rearrangement of the TCRαδ locus, activation of the TCRα enhancer (Eα) and the T early activation (TEA) promoter initiate primary rearrangement of proximal TRAV and TRAJ segments. Subsequent secondary rearrangement occurs using 5’ distal TRAV and distal 3’ TRAJ genes [
42‐
45], resulting in the restricted usage of TRA repertoires (model of sequential bidirectional recombination) [
46]. However, all TRAV genes can recombine with TRAJ genes in secondary rearrangement by the model of locus contraction and DNA looping formation [
47]. Although there was inefficient recombination of distal-proximal and proximal-distal TRAV-TRAJ genes, TRAJ usage was not limited over all TRAV but rather was equally distributed. This suggests that the frequency of recombination varies dependent on the location of TRAV and probably depends on the ability of loop formation between TRAV and TRAJ loci.
Potential TCR diversity generated by recombination and nucleotide addition/deletion has been estimated to be up to 10
15 [
48]. By NGS-based estimation, TRB diversity was estimated to be 3–4 × 10
6 [
33] or approximately 1 × 10
6 in humans [
17]. Furthermore, diversity of TRA is 50 % of that of TRB in humans [
49]. In mice, TRA diversity was suggested to be 0.79 × 10
4 [
44] or 1.18 × 10
4 [
50] and is 10-fold lower than TRB diversity. This lower diversity of TRA might be caused by a difference in recombination processes between TRA and TRB. However, our results showed a similar extent of diversity between TRA and TRB as evaluated by Simpson and Shannon-Weaver indices. Similarly, Wang et al. reported that TCR diversity was estimated to be equal between TRA and TRB (0.47 × 10
6 vs. 0.35 × 10
6) [
51,
52]. Contrary to previous reports obtained using limited number of sequences, large-scale sequencing suggests that the repertoire size for TRA generated by V-J recombination is comparable with that for TRB by V-D-J recombination.
As for TCR diversity, productive TCR had more diverse than unproductive one. Only a portion of T cells produce both productive and unproductive TCRs. This difference might be depend on the number of reads obtained from the library. Also, there was a correlation between the diversity and age. This is consistence with the previous report that age-related decrease in TCR repertoire was found [
53]. Diverse T cells are generated from thymus and the thymic involution occurs with age. The decrease in TCR diversity in periphery is likely due to the age-dependent decrease in thymic T cell regeneration.
Of note, we found that TRA repertoires were considerably similar between individuals. This was mainly due to the presence of shared TCR sequences between two or more individuals. It has been reported that shared TCRβ amino acid sequences have fewer additions in their nucleotide sequences [
54,
55]. Random nucleotide addition and deletion mediated by terminal deoxynucleotidyl transferase occurs during TCR rearrangement, resulting in a remarkable increase in diversity of the CDR3 region. However, the shared TCRs appeared to have germline-like CDR3 sequences that did not undergo such modifications (Table
3). Furthermore, the shared TCRs contained many TCR clonotypes with a shorter CDR3 length. These results suggest that the frequent occurrence of shared TRAs is likely to be caused by a difference in the inherent recombination mechanism from TRB (V-J vs. V-D-J).
It is noteworthy that the shared TRA were present in a large number of individuals. We unexpectedly found that the shared TRA contained a high rate of TCRα related with invariant TCRα derived from MAIT cells or iNKT cells. These functionally important T cells have homogenous TCRα and diverse TCRβ. MAIT cells express a canonical TCRα including TRAV1-2 (Vα7.2)-TRAJ33 (Jα33) and are preferentially localized in the gut lamina propria [
56,
57] and a TCRα bearing TRAV1-2-TRAJ12 and TRAV1-2-TRAJ20 [
58,
59]. MAIT cells recognize vitamin B2 metabolites presented by MR1, non-classical MHC class I molecule [
24,
57]. Furthermore, CD1d-restricted iNKT cells express an invariant TRAV10 (Vα24)-TRAJ18 (Jα18) chain and semi-invariant TRBV25-1 (Vβ11) [
60] and recognize glycolipids such as α-galactosyl ceramide, self-glycolipid, or isoglobotrihexosyl ceramide [
61]. Both cell types play an essential role in the regulation of immune responses against infections, tumors, autoimmune diseases, and tolerance induction [
23]. Frequencies of MAIT and iNKT cells obtained in this study were consistent with previous reports showing MAIT cells expanded up to 1–4 % of peripheral blood T cells [
62] and that iNKT cells accounted for 0.2 % of total PBMCs [
63]. Interestingly, different types of shared sequences bearing TRAV1-2 (for example, TRAV1-2-TRAJ12, TRAV1-2-TRAJ20) and several shared TRA sequences other than the well-known MAIT and iNKT sequences exist. Therefore, NGS-based repertoire analysis is useful for both estimating the frequency of MAIT or iNKT cells as well as identifying potential new invariant TCRα chains. Further identification and verification is required to identify potential novel invariant TCRα.