Distributed Web Crawling over DHTs

Loading...
Thumbnail Image
Penn collection
Departmental Papers (CIS)
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Cooper, Owen
Krishnamurthy, Sailesh
Contributor
Abstract

In this paper, we present the design and implementation of a distributed web crawler. We begin by motivating the need for such a crawler, as a basic building block for decentralized web search applications. The distributed crawler harnesses the excess bandwidth and computing resources of clients to crawl the web. Nodes participating in the crawl use a Distributed Hash Table (DHT) to coordinate and distribute work. We study different crawl distribution strategies and investigate the trade-offs in communication overheads, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present an implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on Planet-Lab querying live web sources.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2004-02-01
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
University of California, Berkeley Department of Electrical Engineering and Computer Sciences Technical Report No. CSD-04-1305 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. (Department of Electrical Engineering and Computer Sciences, University of California, Berkeley) NOTE: At the time of publication, author Boon Thau Loo was affiliated with the University of California at Berkeley. Currently (March 2007), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania.
Recommended citation
Collection