Statistical learning from relational databases

Alexandrin Popescul, University of Pennsylvania


One fundamental limitation of classical statistical modeling is the assumption that data is represented by a single table, even though most real-world problem domains have complex relational structure. Mapping data into a single table prior to training is often complicated and indeed prohibitively expensive. A better approach is to use a statistical relational learning method to combine the strengths of statistical approaches with the higher expressivity of features automatically generated from complex data sources. Features can be generated lazily and selected based on sequential feature selection criteria of a corresponding statistical model. This thesis presents a framework for learning discriminative statistical models from relational databases. The framework integrates incremental feature generation and selection into a single loop. We formulate feature generation as a search in the space of relational database queries expressed in SQL. Database queries result in candidate feature columns, which are sequentially considered for inclusion into the model by statistical model selection criteria. The structure of the search space is based on the top-down search of the refinement graph widely used in inductive logic programming for learning logic descriptions from relational, first-order, representations. The use of statistics allows us to expand the original definition of refinement graphs beyond boolean logic values to include a richer set of features constructed via aggregate operators. We expand the feature space by augmenting the basic relational schema with cluster-relations which are automatically generated and factored into the search during learning. Using clusters improves scalability through dimensionality reduction. More importantly, entities derived from clusters increase the expressivity of feature spaces by creating new first-class concepts which contribute to the creation of new features in more complex ways. For example, in CiteSeer, papers can be clustered based on words or citations, giving “topics”, and authors can be clustered based on documents they co-author, giving “communities”. Such cluster-derived concepts become part of more complex feature expressions. Out of the large number of generated features, those which improve predictive accuracy are kept in the model, as decided by statistical feature selection criteria. We provide a dynamic feature generation method. Dynamic feature generation can lead to discovery of predictive features with less computation than generating all features in advance. Dynamic feature generation decides the order in which features are evaluated based on run-time feature selection feedback. We demonstrate the utility of our methodology in link prediction and document classification. We use CiteSeer, an online digital library of computer science papers which contains a rich set of relational data, including citation information, author names, conference and journal names, and text of papers. In the link prediction application, we discover highly predictive features capturing complex regularities of the citation structure and document attributes, demonstrating the feasibility of the methodology in a more general social network setting. The document classification task shows that modeling of more complex features than classical word counts improves document classification accuracy.

Subject Area

Computer science

Recommended Citation

Popescul, Alexandrin, "Statistical learning from relational databases" (2004). Dissertations available from ProQuest. AAI3125887.