Learning to scale up search-driven data integration

Zhepeng Yan, University of Pennsylvania


A recent movement to tackle the long-standing data integration problem is a compositional and iterative approach, termed “pay-as-you-go” data integration. Under this model, the objective is to immediately support queries over “partly integrated” data, and to enable the user community to drive integration of the data that relate to their actual information needs. Over time, data will be gradually integrated. While the pay-as-you-go vision has been well-articulated for some time, only recently have we begun to understand how it can be manifested into a system implementation. One branch of this effort has focused on enabling queries through keyword search-driven data integration, in which users pose queries over partly integrated data encoded as a graph, receive ranked answers generated from data and metadata that is linked at query-time, and provide feedback on those answers. From this user feedback, the system learns to repair bad schema matches or record links. Many real world issues of uncertainty and diversity in search-driven integration remain open. Such tasks in search-driven integration require a combination of human guidance and machine learning. The challenge is how to make maximal use of limited human input. This thesis develops three methods to scale up search-driven integration, through learning from expert feedback: (1) active learning techniques to repair links from small amounts of user feedback; (2) collaborative learning techniques to combine users’ conflicting feedback; and (3) debugging techniques to identify where data experts could best improve integration quality. We implement these methods within the Q System, a prototype of search-driven integration, and validate their effectiveness over real-world datasets.

Subject Area

Computer science

Recommended Citation

Yan, Zhepeng, "Learning to scale up search-driven data integration" (2016). Dissertations available from ProQuest. AAI10239958.