Extraction of Web Information Using W4F Wrapper Factory and XML-QL Query Language

Bhandari, Deepali

Extraction of Web Information Using W4F Wrapper Factory and XML-QL Query Language

Files

MS_CIS_99_20.pdf (1.17 MB)

Permalink

https://repository.upenn.edu/handle/20.500.14332/7833

View all metadata

Author

Bhandari, Deepali

Abstract

In many ways, the Web has become the largest knowledge base known to us. The problem facing the user now is not that the information he seeks is not available, but that it is not easy for him to extract exactly what he needs from what is available. It is also becoming clear that a top down approach of gathering all the information, and structuring it will not work, except in some special cases. Indeed, most of the information is present in HTML documents structured only for visual content. Instead, new tools are being developed that attack this problem from a different angle. XML is a language that allows the publisher of the data to structure it using markup tags. These mark-up tags clarify not only the visual structure of the document, but also the semantic structure. Additionally, one can make use of a query language XML-QL to query XML pages for information, and to merge information from disparate XML sources. However, most of the content of the web is published in HTML. The W4F system allows us to construct wrappers that retrieve web pages, extract desired information using the HTML structure and regular expression search and map it automatically to XML with its XML-Gateway feature. In this thesis, we investigate the W4F/XML-QL paradigm to query the web. Two examples are presented. The first is the Internet Movie Database, and we query it with the idea of understanding the power of these systems. The second is the NCBI BLAST server, which is a suite of programs for biomolecular sequence analysis. We demonstrate that there are real life instances where this paradigm promises to be extremely useful.

Date of degree

1999-08-01

Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-99-20.

Collection

Dissertations and Theses