Efficient Web Harvesting Strategies for Monitoring Deep Web Content

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

Focused Web Harvesting aims at achieving a complete harvest of a set of related web data for a given topic. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.

[download pdf]

The software for this work is available as: HaverstED.

Niels Visser graduates on automated web harvesting

Fully automated web harvesting using a combination of new and existing heuristics

by Niels Visser

Several techniques exist for extracting useful content from web pages. However, the definition of 'useful' is very broad and context dependant. In this research, several techniques – existing ones and new ones – are evaluated and combined in order to extract object data in a fully automatic way. The data source used for this, are mostly web shops, sites that promote housing, and vacancy sites. The data to be extracted from these pages, are respectively items, houses and vacancies. Three kinds of approaches are combined and evaluated: clustering algorithms, algorithms that compare pages, and algorithms that look at the structure of single pages. Clustering is done in order to differentiate between pages that contain data and pages that do not. The algorithms that extract the actual data are then executed on the cluster that is expected to contain the most useful data. The quality measure used to assess the performance of the applied techniques are precision and recall per page. It can be seen that without proper clustering, the algorithms that extract the actual data perform very bad. Whether or not clustering performs acceptable heavily depends on the web site. For some sites, URL based clustering outstands (for example: nationalevacaturebank.nl and funda.nl) with precisions of around 33% and recalls of around 85%. URL based clustering is therefore the most promising clustering method reviewed by this research. Of the extraction methods, the existing methods perform better than the alterations proposed by this research. Algorithms that look at the structure (intra page document structure) perform best of all four methods that are compared with an average recall between 30% to 50%, and an average precision ranging from very low (around 2%) to quite low (around 33%). Template induction, an algorithm that compares between pages, performs relatively well as well, however, it is more dependent on the quality of the clusters. The conclusion of this research is that it is not possible yet using a combination of the methods that are discussed and proposed to fully automatically extract data from websites.

Towards Complete Coverage in Focused Web Harvesting

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance “Michael Jackson”, “Islamic State”, or “FC Barcelona” from indexed data in search engines, or hidden data behind web forms, using a minimum number of queries. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. These limitations are also applied in deep web sources, for instance in social networks like Twitter. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine’s limitations. The approach minimizes the number of queries that need to be sent by analysing the retrieved results and combining this analysed information with information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

To be presented at the 17th International Conference on Information Integration and Web-based Applications & Services on 11 – 13 December 2015 in Brussels, Belgium

[download pdf]

Harvesting all matching information to a given query from a deep website

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

In this paper, the goal is harvesting all documents matching a given (entity) query from a deep web source. The objective is to retrieve all information about for instance “Denzel Washington”, “Iran Nuclear Deal”, or “FC Barcelona” from data hidden behind web forms. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by applying information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

To be presented at the 1st International Workshop on Knowledge Discovery on the Web (KDWeb 2015) on 3-5 September in Cagliari, Italy.

[download pdf]

Designing a Deep Web Harvester

by Mohamamdreza Khelghati, Maurice van Keulen, and Djoerd Hiemstra

To make deep web data accessible, harvesters have a crucial role. Targeting different domains and websites requires the need of a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all influential elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HFW) or a harvester (HFH) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' or harvesters' features. These elements are gathered from literature or introduced through the authors' experiments. In addition to enabling designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing harvesters. Designers can define the list of features and prioritize their implementations. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To validate the HFH as an evaluation metric, it is shown how it can be calculated for the harvester implemented by the authors. The results show that the developed harvester works pretty well for the targeted test set by a score of 14.8 of 15.

To be presented at Riva del Garda, Trentino, Italy at the Workshop on Surfacing the Deep and the Social Web (SDSW 2014), a workshop co-located with The 13th International Semantic Web Conference

Sabbatical at Q-Able

Starting today, I am on sabbatical at Q-Able, an exciting new internet startup and spinoff of the University of Twente. Q-Able will bring new search capabilities to internet web shops, hotel and travel booking sites, online banking, etc. by replacing multi-field web forms by free text querying. Instead of meticulously filling in one field at a time of a web form, users of your web site get a simple, single search field. Q-Able's solutions provide a better user experience for the visitors of web sites, and it gives the company running the web site the opportunity to find out what their customers really want (you'd be surprised of the things people will enter in single search fields).

More information shortly at: q-able.com.

Deep Web Entity Monitoring

by Mohammad Khelghati

Search engines do not cover all the data available on the Web. In addition to the fact that none of these search engines cover all the webpages existing on the Web, they miss the data behind web search forms. This data is defined as hidden web or deep web which is not accessible through search engines. It is estimated that deep web contains data in a scale several times bigger than the data accessible through search engines which is referred to as surface web. Although this information on deep web could be accessed through their own interfaces, finding and querying all the interesting sources of information that might be useful could be a difficult, time-consuming and tiring task. Considering the huge amount of information that might be related to one’s information needs, it might be even impossible for a person to cover all the deep web sources of his interest. Therefore, there is a great demand for applications which can facilitate accessing this big amount of data being locked behind web search forms. Realizing approaches to meet this demand is one of the main issues targeted in this PhD project. Having provided the access to deep web data, different techniques can be applied to provide users with additional values out of this data. Analyzing data, finding patterns and relationships among different data items and also data sources are considered as some of these techniques. However, in this research, monitoring entities existing in deep web sources is targeted.

To be presented at the World Wide Web Conference Doctorial Consortium on 13 May in Rio de Janeiro, Brasil.

Size estimation of non-cooperative data collections

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

In this paper, approaches for estimating the size of non-cooperative databases and search engines are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.

To be presented at the 14th International Conference on Information Integration and Web-based Applications and Services (iiWAS 2012) on 3-5 December 2012 in Bali, Indonesia

[download pdf]