Fully automated web harvesting using a combination of new and existing heuristics
by Niels Visser
Several techniques exist for extracting useful content from web pages. However, the definition of 'useful' is very broad and context dependant. In this research, several techniques – existing ones and new ones – are evaluated and combined in order to extract object data in a fully automatic way. The data source used for this, are mostly web shops, sites that promote housing, and vacancy sites. The data to be extracted from these pages, are respectively items, houses and vacancies. Three kinds of approaches are combined and evaluated: clustering algorithms, algorithms that compare pages, and algorithms that look at the structure of single pages. Clustering is done in order to differentiate between pages that contain data and pages that do not. The algorithms that extract the actual data are then executed on the cluster that is expected to contain the most useful data. The quality measure used to assess the performance of the applied techniques are precision and recall per page. It can be seen that without proper clustering, the algorithms that extract the actual data perform very bad. Whether or not clustering performs acceptable heavily depends on the web site. For some sites, URL based clustering outstands (for example: nationalevacaturebank.nl and funda.nl) with precisions of around 33% and recalls of around 85%. URL based clustering is therefore the most promising clustering method reviewed by this research. Of the extraction methods, the existing methods perform better than the alterations proposed by this research. Algorithms that look at the structure (intra page document structure) perform best of all four methods that are compared with an average recall between 30% to 50%, and an average precision ranging from very low (around 2%) to quite low (around 33%). Template induction, an algorithm that compares between pages, performs relatively well as well, however, it is more dependent on the quality of the clusters. The conclusion of this research is that it is not possible yet using a combination of the methods that are discussed and proposed to fully automatically extract data from websites.