Statistical Relational Learning for Document Mining

Popescul, Alexandrin; Ungar, Lyle H; Lawrence, Steve; Pennock, David M.

Statistical Relational Learning for Document Mining

Files

ieeeMax53.pdf (312.47 KB)

Penn collection

Departmental Papers (CIS)

Permalink

https://repository.upenn.edu/handle/20.500.14332/6256

View all metadata

Author

Popescul, Alexandrin

Ungar, Lyle H

Lawrence, Steve

Pennock, David M.

Abstract

A major obstacle to fully integrated deployment of many data mining algorithms is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. We propose an integrated approach to statistical modeling from relational databases. We structure the search space based on "refinement graphs", which are widely used in inductive logic programming for learning logic descriptions. The use of statistics allows us to extend the search space to include richer set of features, including many which are not boolean. Search and model selection are integrated into a single process, allowing information criteria native to the statistical model, for example logistic regression, to make feature selection decisions in a step-wise manner. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. Our approach results in classification accuracies superior to those achieved when using classical "flat" features. The resulting classifier can be used to recommend where to publish articles.

Date of presentation

2003-11-19

Conference name

Departmental Papers (CIS)

Conference dates

2023-05-16T21:39:40.000

Comments

Copyright 2003 IEEE. Reprinted from Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), pages 275-282. Publisher URL: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=27998&page=2 This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Collection

Presentations