Elsevier

Pattern Recognition

Volume 41, Issue 12, December 2008, Pages 3600-3612
Pattern Recognition

Learning a Mahalanobis distance metric for data clustering and classification

https://doi.org/10.1016/j.patcog.2008.05.018Get rights and content

Abstract

Distance metric is a key issue in many machine learning algorithms. This paper considers a general problem of learning from pairwise constraints in the form of must-links and cannot-links. As one kind of side information, a must-link indicates the pair of the two data points must be in a same class, while a cannot-link indicates that the two data points must be in two different classes. Given must-link and cannot-link information, our goal is to learn a Mahalanobis distance metric. Under this metric, we hope the distances of point pairs in must-links are as small as possible and those of point pairs in cannot-links are as large as possible. This task is formulated as a constrained optimization problem, in which the global optimum can be obtained effectively and efficiently. Finally, some applications in data clustering, interactive natural image segmentation and face pose estimation are given in this paper. Experimental results illustrate the effectiveness of our algorithm.

Introduction

Distance metric is a key issue in many machine learning algorithms. For example, Kmeans and K-nearest neighbor (KNN) classifier need to be supplied a suitable distance metric, through which neighboring data points can be identified. The commonly used Euclidean distance metric assumes that each feature of data point is equally important and independent from others. This assumption may not be always satisfied in real applications, especially when dealing with high dimensional data where some features may not be tightly related to the topic of interest. In contrast, a distance metric with good quality should identify important features and discriminate relevant and irrelevant features. Thus, supplying such a distance metric is highly problem-specific and determines the success or failure of the learning algorithm or the developed system [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13].

There has been considerable research on distance metric learning over the past few years [14]. One family of algorithms are developed with known class labels of training data points. Algorithms in this family include the neighboring component analysis [15], large margin nearest neighbor classification [16], large margin component analysis [17], class collapse [18], and other extension work [19], [20]. The success in a variety of problems shows that the learned distance metric yields substantial improvements over the commonly used Euclidean distance metric [15], [16], [17], [18]. However, class label may be strong information from the users and cannot be easily obtained in some real-world situations. In contrast, it is more natural to specify which pairs of data points are similar or dissimilar. Such pairwise constraints appear popularly in many applications. For example, in image retrieval the similar and dissimilar images to the query one are labeled by the user and such image pairs can be used to learn a distance metric [21]. Accordingly, another family of distance metric learning algorithms are developed to make use of such pairwise constraints [14], [21], [22], [23], [24], [25], [26], [27], [28], [29]. Pairwise constraint is a kind of side information [22]. One popular form of side information is must-links and cannot-links [22], [30], [31], [32], [33], [34], [35]. A must-link indicates the pair of data points must be in a same class, while a cannot-link indicates that the two data points must be in two different classes. Another popular form is the relative comparison with “A is closer to B than A is to C” [26]. The utility of pairwise constraints has been demonstrated in many applications, indicating that significantly improvement of the algorithm can be achieved [21], [22], [23], [24], [25], [26], [27].

The two families of distance learning algorithms are extended in many aspects. Based on the class labels of training data points, Weinberger and Tesauro proposed to learn distance metric for kernel regression [36]. Based on labeled training data, Hertz et al. maximized the margin with boosting to obtain distance functions for clustering [37]. Bilenko et al. integrated the pairwise constraints (must-links and cannot-links) and metric learning into a semi-supervised clustering [38]. Clustering on many data sets shows that the performance of Kmeans algorithm has been substantially improved. Also based on must-links and cannot-links, Davis et al. developed an information theory-based framework [39]. Compared with most existing methods, their framework need not perform complex computation, such as eigenvalue decomposition and semi-definite programming [15], [16]. Yang et al. presented a Bayesian framework in which a posterior distribution for the distance metric is estimated from the labeled pairwise constraints [40]. Kumar et al. used the relative comparisons to develop a new clustering algorithm in a semi-supervised clustering setting [41]. Formulating the problem as a linear programming, Rosales and Fung proposed to learn a sparse metric with relative comparison constraints. The sparsity of the learned metric can help to reduce the distance computation [42]. In addition, the distance metric learning algorithms are also extended with kernel tricks [11], [21], [43], [44], [45]. Nonlinear adaptive metric learning algorithm has also been developed [46]. Furthermore, some online distance metrics learning algorithms [39], [47] have been proposed recently for the situations where the data points are collected sequentially. The use of the learned distance metrics has been demonstrated in many real-word applications, including speech processing [48], visual representation [49], word categorization [12], face verification [50], medical image processing [51], video object classification[52], biological data processing [53], image retrieval [21], [54], and so on.

In this paper we focus on learning a Mahalanobis distance metric from must-links and cannot-links. The Mahalanobis distance is a measure between two data points in the space defined by relevant features. Since it accounts for unequal variances as well as correlations between features, it will adequately evaluate the distance by assigning different weights or importance factors to the features of data points. Only when the features are uncorrelated, the distance under a Mahalanobis distance metric is identical to that under the Euclidean distance metric. In addition, geometrically, a Mahalanobis distance metric can adjust the geometrical distribution of data so that the distance between similar data points is small [22]. Thus it can enhance the performance of clustering or classification algorithms, such as KNN classifier. Such advantages can be used to perform special tasks on a given data set, if given a suitable Mahalanobis distance metric. It is natural to learn it from some prior knowledge supplied by the user according to her/his own task. One easy way to supply prior knowledge is to supply some instances of similar/dissimilar data pairs (must-links/cannot-links). We hope a Mahalanobis distance metric can be learned by forcing it to adjust the distances of the given instances and then applied to new data.

The basic idea in this paper is to minimize the distances of point pairs in must-links and maximize those of point pairs in cannot-links. To this end, we formulate this task as a constrained optimization problem. Since the formulated problem cannot be analytically solved, an iterative framework is developed to find the optimum in way of binary search. A lower bound and an upper bound including the optimum are explicitly estimated and then used to control the initial value. This will benefit the initialization of the iterative algorithm. The globally optimal Mahalanobis distance matrix is finally obtained effectively and efficiently. In addition, the computation is also fast, up to exponential convergence. Comparative experiments on data clustering, interactive natural image segmentation and face pose estimation show the validity of our algorithm.

The remainder of this paper is organized as follows. Section 2 will briefly introduce the related work and our method. We address our problem and develop the algorithm in Section 3. The experimental results and applications in data clustering, interactive image segmentation and face pose estimation are reported in Section 4. Section 5 concludes this paper.

Section snippets

Related work and our method

Given two data points x1Rn and x2Rn, their Mahalanobis distance can be calculated as follows:dA(x1,x2)=(x1-x2)TA(x1-x2)where ARn×n is positively semi-definite. Using the eigenvalue decomposition, A can be decomposed into A=WWT. Thus, it is also feasible to learn the matrix W. Then, we havedA(x1,x2)=(x1-x2)T·(WWT)·(x1-x2)

Typically, Xing et al. studied the problem of learning a Mahalanobis matrix from must-links and cannot-links [22]. In their framework, the sum of the Mahalanobis distances of

Problem formulation

Suppose we are given a data set X={xi}i=1NRn and two sets of pairwise constraints including must-links: S={(xi,xj)|xiandxjareinasameclass}, and cannot-links: D={(xi,xj)|xiandxjareintwodifferentclasses}. Our goal is to learn a Mahalanobis matrix A such that the distances of point pairs in S are as small as possible, while those in D are as large as possible.

According to Eq. (2), equivalently, we can select to optimize the matrix WRn×d, with dn. To this end, we introduce a transformation:y=WTx.

Experiments

We evaluated our algorithm on several data sets, and compared it with RCA, DCA and Xing's method. We show the experimental results and the applications to data clustering, interactive natural image segmentation and face pose estimation.

Conclusion

In summary, this paper addresses a general problem of learning a Mahalanobis distance metric from side information. It is formulated as a constrained optimization problem, in which the ratio of distances (in terms of ratio of matrix traces) is used as the objective function. An optimization algorithm is proposed to solve this problem, in which a lower bound and an upper bound including the optimum are explicitly estimated and then used to stipulate the initial value for iterations. Experimental

Acknowledgments

This work is supported by the projects (Grant No. 60721003 and 60675009) of the National Natural Science Foundation of China. The anonymous reviewers have helped to improve the quality and representation of this paper.

About the Author—SHIMING XIANG received his B.S. degree from Chongqing Normal University, China in 1993 and M.S. degree from Chongqing University, China in 1996, and Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, China in 2004. From 1996 to 2001, he was a lecturer of Huazhong University of Science and Technology, Wuhan, China. He was a Post Doctor in the Department of Automation, Tsinghua University until 2006. He is currently an Associate Professor in the

References (67)

  • R. Yan et al.

    Negative pseudo-relevance feedback in content-based video retrieval

  • X. He et al.

    Learning an image manifold for retrieval

  • J.R. He et al.

    Manifold ranking based image retrieval

  • A.S. Varde et al.

    Learning semantics-preserving distance metrics for clustering graphical data

  • G. Wu et al.

    Formulating context-dependent similarity functions

  • E.E. Korkmaz et al.

    Choosing a distance metric for automatic word categorization

  • F. Li et al.

    A transductive framework of distance metric learning by spectral dimensionality reduction

  • L. Yang, R. Jin, Distance metric learning: a comprehensive survey, Technical Report, Michigan State University...
  • J. Goldberger et al.

    Neighbourhood components analysis

  • K. Weinberger et al.

    Distance metric learning for large margin nearest neighbor classification

  • L. Torresani et al.

    Large margin component analysis

  • A. Globerson et al.

    Metric learning by collapsing classes

  • Z.H. Zhang et al.

    Parametric distance metric learning with label information

  • G. Lebanon, Flexible metric nearest neighbor classification, Technical Report, Statistics Department, Stanford...
  • C.H. Hoi, W. Liu, M.R. Lyu, W.Y. Ma, Learning distance metrics with contextual constraints for image retrieval, in:...
  • E.P. Xing et al.

    Distance metric learning, with application to clustering with side-information

  • A. Bar-hillel et al.

    Learning distance functions using equivalence relations

    J. Mach. Learn. Res.

    (2005)
  • I.W. Tsang et al.

    Distance metric learning with kernels

  • R. Rosales et al.

    Learning sparse metrics via linear programming

  • M. Schultz et al.

    Learning a distance metric from relative comparisons

  • L. Yang, R. Jin, R. Sukthankar, Y. Liu, An efficient algorithm for local distance metric learning, in: AAAI, Boston,...
  • D. Mochihashi et al.

    Learning nonstructural distance metric by minimum cluster distortions

  • W. Tang, S. Zhong, Pairwise constraints-guided dimensionality reduction, in: SDM Workshop on Feature Selection for Data...
  • Cited by (530)

    • Secure delegated quantum algorithms for solving Mahalanobis distance

      2023, Physica A: Statistical Mechanics and its Applications
    View all citing articles on Scopus

    About the Author—SHIMING XIANG received his B.S. degree from Chongqing Normal University, China in 1993 and M.S. degree from Chongqing University, China in 1996, and Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, China in 2004. From 1996 to 2001, he was a lecturer of Huazhong University of Science and Technology, Wuhan, China. He was a Post Doctor in the Department of Automation, Tsinghua University until 2006. He is currently an Associate Professor in the Institute of Automation, Chinese Academy of Science. His interests include computer vision, pattern recognition, machine learning, etc.

    About the Author—FEIPING NIE received his B.S. degree from the Department of Computer Science, North China University of Water Conservancy and Electric Power, China in 2000, and M.S. degree from the Department of Computer Science, Lanzhou University, China in 2003. He is currently a Ph.D. candidate in the Department of Automation, Tsinghua University. His research interests focus on machine learning and its applications.

    About the Author—CHANGSHUI ZHANG received his B.S. degree from the Department of Mathematics of Peking University, China in 1986, and Ph.D. degree from the Department of Automation, Tsinghua University in 1992. He is currently a Professor of Department of Automation, Tsinghua University. He is an Associate Editor of the Journal of Pattern Recognition. His interests include artificial intelligence, image processing, pattern recognition, machine learning, evolutionary computation and complex system analysis, etc.

    View full text