Departmental Papers (CIS)

Analyzing Knowledge Communities Using Foreground and Background Clusters

Lyle Ungar, University of Pennsylvania
Vasileios Kandylas, University of Pennsylvania
Samuel Phineas Upham, University of Pennsylvania

Document Type Journal Article

V. Kandylas was supported in part by the Greek State Scholarship Foundation (IKY). Authors’ addresses: V. Kandylas, Department of Computer and Information Science, University of Pennsylvania; email:; S. P. Upham, Whartom School, University of Pennsylvania; email:; L. Ungar, Department of Computer and Information Science, University of Pennsylvania; email: Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc.,


Insight into the growth (or shrinkage) of “knowledge communities” of authors that build on each other's work can be gained by studying the evolution over time of clusters of documents. We cluster documents based on the documents they cite in common using the Streemer clustering method, which finds cohesive foreground clusters (the knowledge communities) embedded in a diffuse background. We build predictive models with features based on the citation structure, the vocabulary of the papers, and the affiliations and prestige of the authors and use these models to study the drivers of community growth and the predictors of how widely a paper will be cited. We find that scientific knowledge communities tend to grow more rapidly if their publications build on diverse information and use narrow vocabulary and that papers that lie on the periphery of a community have the highest impact, while those not in any community have the lowest impact.


Date Posted: 25 July 2012