Efficient Web Harvesting Strategies for Monitoring Deep Web Content

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

Focused Web Harvesting aims at achieving a complete harvest of a set of related web data for a given topic. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.

[download pdf]

The software for this work is available as: HaverstED.