Distributed Web Crawling over DHTs
Penn collection
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
In this paper, we present the design and implementation of a distributed web crawler. We begin by motivating the need for such a crawler, as a basic building block for decentralized web search applications. The distributed crawler harnesses the excess bandwidth and computing resources of clients to crawl the web. Nodes participating in the crawl use a Distributed Hash Table (DHT) to coordinate and distribute work. We study different crawl distribution strategies and investigate the trade-offs in communication overheads, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present an implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on Planet-Lab querying live web sources.