IRCS Technical Reports Series
Date of this Version
Information extraction consists in identifying classes of events and relationships between extracted instances of these classes. In general, extracted data usually fills slots in a template and is stored in tables. We propose to extend the usual approach to the use of an object database. Information extraction tools have a conceptual representation as schema components: concept classes, meta-concepts and attributes. The user expresses in his query a structure (target structure) which corresponds to his understanding of the domain and is used as a schema for the database. We use the object data model whose syntax matches both the user's target structure and the conceptual representation of extracting capabilities. Query evaluation consists in first determining the schema of the database as expressed by the user, and secondly populating the database through methods invoking extraction tools on a given source of documents. In a third step, it returns the output of the query against the resulting database. The two first steps define an object view of the given source(s) as a materialized extension of the current schema (each refinement of a query may add more structure, and thus more extracted data) followed by a non-materialized projection.
Our approach is user-oriented: the object representation of data provides the user with the flexibility of asking his query with his understanding of the domain, and object views are built on-the-fly according to the user's organization of data. The modularity of the conceptual representation of extraction capabilities in a pool of schema components enables easy plug-in of new extracting tools.
Information extraction, object data model, object view.
Date Posted: 20 August 2006
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-11.