To get the full set of human SNP data, we retrieved the recent version of human SNPs from NCBI dbSNP (
http://www.ncbi.nlm.nih.gov/SNP/). Among the “common” and “all” SNPs that are labeled distinctly in the website, we first chose the common SNPs for analysis due to its higher occurrence and reliability, since the total number of all SNPs already came up to one tenth of the human genome. In total, 34,082,224 common SNP sites were downloaded. We began to filter these common SNPs (Fig.
1a). We sought for the ancestral variants in human (
Homo sapiens) genome according to the orthologous sites in two other mammalian species, rhesus macaque (monkey,
Macaca mulatta) and mouse (
Mus musculus). Only those SNPs with genomic sequences identical to at least one species in monkey or mouse were considered as ancestral SNPs (Fig.
1a). Next, part of the SNPs has more than one mutation types and might cause conflict in functional annotation. Thus, these “multi-mutation” SNPs were discarded and only those “uni-mutation” SNPs were retained (Fig.
1b). After filtration, 21,221,571 SNPs were remaining, among which the C > T and G > A transitions are the most prevalent while transversions are less frequent (Fig.
1c). We annotated the SNPs with SnpEff [
33] and the canonical transcript of each gene were chosen. Most of the SNPs were located in intergenic or intronic region, and the exonic SNPs were dispersed in CDSs, UTRs or noncoding RNAs (Fig.
1d). There were totally 17,940 genes that has at least one SNP in CDS (Additional file
2: Table S2). Note that we have excluded those sites in splicing region when defining nonsynonymous or synonymous mutations in CDSs. Therefore, if any selection is detected for the nonsynonymous or synonymous sites, it might not be imposed by the selection on splicing events.