Distributed Web Crawling over DHTs

Loo, Boon Thau; Cooper, Owen; Krishnamurthy, Sailesh

Distributed Web Crawling over DHTs

Files

webcrawl.pdf (257.98 KB)

Penn collection

Departmental Papers (CIS)

Permalink

https://repository.upenn.edu/handle/20.500.14332/6365

View all metadata

Author

Loo, Boon Thau

Cooper, Owen

Krishnamurthy, Sailesh

Abstract

In this paper, we present the design and implementation of a distributed web crawler. We begin by motivating the need for such a crawler, as a basic building block for decentralized web search applications. The distributed crawler harnesses the excess bandwidth and computing resources of clients to crawl the web. Nodes participating in the crawl use a Distributed Hash Table (DHT) to coordinate and distribute work. We study different crawl distribution strategies and investigate the trade-offs in communication overheads, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present an implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on Planet-Lab querying live web sources.

Publication date

2004-02-01

Comments

University of California, Berkeley Department of Electrical Engineering and Computer Sciences Technical Report No. CSD-04-1305 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. (Department of Electrical Engineering and Computer Sciences, University of California, Berkeley) NOTE: At the time of publication, author Boon Thau Loo was affiliated with the University of California at Berkeley. Currently (March 2007), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania.

Collection

Reports