by Mohammadreza Khelghati
Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist or a business analyst, your primary concern is to access all the relevant data to your topics of interest. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data. In accessing web data through either general search engines or direct querying of deep web sources, the laborious work of querying, navigating results, downloading, storing and tracking data changes is a burden on shoulders of users. To decrease this intensive labor work of accessing data, (semi-)automatic harvesters have a crucial role. However, they lack a number of functionalities that we discuss and address in this work.
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles. First, we try to find methods that can achieve the best coverage in harvesting data for a topic. Although using a fully automatic general harvester facilitates accessing web data, it is not a complete solution to collect a thorough data coverage on a given topic. Some search engines, in both surface web and deep web, restrict the number of requests from a user or limit the number of returned results presented to him. We suggest an efficient approach which can pass these limitations and achieve a complete data coverage.
Second, we investigate reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size. Harvesting tasks continue till they face the posed query submission limitations by search engines or consume all the allocated resources. To prevent this undesirable situation, we need to know the size of the targeted source. For a website that hides the true size of its residing data, we suggest an accurate method to estimate its size.
As the third challenge, we focus on monitoring data changes over time in web data repositories. This information is helpful in providing the most up-to-date answers to information needs of users. The fast evolving web adds extra challenges for having an up-to-date data collection. Considering the costly process of harvesting, it is important to find methods which facilitate efficient re-harvesting processes.
Lastly, we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings.
These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.