Technical Reports (CIS)

Document Type

Technical Report

Date of this Version

June 2008


University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-08-23.


Distance-based learning methods, like clustering and SVMs, are dependent on good distance metrics. This paper does unsupervised metric learning in the context of clustering. We seek transformations of data which give clean and well separated clusters where clean clusters are those for which membership can be accurately predicted. The transformation (hence distance metric) is obtained by minimizing the blur ratio, which is defined as the ratio of the within cluster variance divided by the total data variance in the transformed space. For minimization we propose an iterative procedure, Clustering Predictions of Cluster Membership (CPCM). CPCM alternately (a) predicts cluster memberships (e.g., using linear regression) and (b) clusters these predictions (e.g., using k-means). With linear regression and k-means, this algorithm is guaranteed to converge to a fixed point. The resulting clusters are invariant to linear transformations of original features, and tend to eliminate noise features by driving their weights to zero.



Date Posted: 16 June 2008