The data
We used the subset of the data from the North American Rheumatoid Arthritis Consortium (NARAC) study provided for Genetic Analysis Workshop 16 [
9]. It consisted of 868 cases and 1194 controls. The Ilumina 550 k chip (545,080 SNPs) was genotyped in the whole sample. Before the analysis, we cleaned the data. For example, we excluded SNPs with a call rate smaller than 0.95 (18,627 SNPs), minor allele frequency smaller than 0.01 (23,047 SNPs), or with a
p-value for the Hardy-Weingberg equilibrium test smaller than 1 × 10
-5 (1,342 SNPs). We also excluded six individuals with sex genotype inconsistencies. All of the individuals had a call rate greater than 0.95. All of our analyses were done with 'affection status' as the trait of interest and 'sex' as a covariate.
Gene-based test
To perform a gene-based test, the first problem is to define the genes and to assign SNPs to the genes. Also, it is important to be sure that the physical positions of the genes and SNPs refer to the same annotation release. We used the NCBI build 129 release 36.3 for both genes and SNPs. We accepted that a SNP was in a gene if it was inside the gene plus or minus 5,000 base pairs. We analyzed 21,672 genes that contained 272,604 SNPs. That means that around half of the available SNPs were not assigned to genes and, thus, they were left out of the gene-based analysis.
The paradigm for the proposed gene-based test has three steps [
11]:
1. Estimate the genetic similarity among individuals based on the genotypes of the SNPs in a given gene.
2. Cluster the individuals in groups by genetic similarity.
3. Test the association between the groups of individuals and the trait of interest.
In the first step, we used the Gower distance. Also known as Gower's coefficient [
12], it is a measure of the similarity between two individuals based on the information given by a set of quantitative or qualitative variables. We realized that, in the special case of SNP genotypes, Gower distance is the same as the identity-by-state (IBS) multilocus measure. IBS allele sharing is a measure of genetic similarity between two individuals. Given the genotypes of two individuals at a given SNP, the IBS between them is 0, 1, or 2 depending on whether they share 0, 1, or 2 alleles at that SNP. This measure can be extended to several SNPs by adding the IBS for each locus and dividing by twice the number of loci [
13]:
where L is the number of loci considered in the calculation; g
l
i
and g
l
j
are the genotypes of individuals i and j, respectively, at the lth locus (l = 1, ..., L); and IBS
l
ij
is the IBS between i and j at locus l. We estimated this similarity measure for every pair of individuals at every gene. Thus, the result of the first step is a distance matrix among individuals in a given gene.
In the second step, the distance matrix is used for finding groups of individuals with similar genotypic distribution in the given gene. This clustering is performed in a hierarchical procedure by means of a complete linkage agglomerative algorithm. Complete linkage evaluates distances between two groups as the distance of their most distant pair of individuals. We divided the individuals in three groups of similarity. To test the effect of the chosen cluster algorithm on the results, we repeated the analysis in chromosomes 1, 6, and 9 using two other cluster algorithms: hierarchical average linkage agglomerative algorithm and spectral clustering. Average linkage evaluates distances between two groups as the mean distance between individuals of each cluster.
Spectral clustering is a method that defines
k clusters on a set of
n data points representing arbitrary objects. It is based on the spectral decomposition of the normalized Laplacian graph defined from a similarity matrix among the objects [
14]. In our case, the objects are the individuals and the similarity matrix is the genetic similarity matrix defined above.
Finally, in the third step, association between the groups of individuals and the phenotype of interest was estimated using logistic regression with the group as a factor.