Fully automated web harvesting using a combination of new and existing heuristics
by Niels Visser
Several techniques exist for extracting useful content from web pages. However, the definition of 'useful' is very broad and context dependant. In this research, several techniques – existing ones and new ones – are evaluated and combined in order to extract object data in a fully automatic way. The data source used for this, are mostly web shops, sites that promote housing, and vacancy sites. The data to be extracted from these pages, are respectively items, houses and vacancies. Three kinds of approaches are combined and evaluated: clustering algorithms, algorithms that compare pages, and algorithms that look at the structure of single pages. Clustering is done in order to differentiate between pages that contain data and pages that do not. The algorithms that extract the actual data are then executed on the cluster that is expected to contain the most useful data. The quality measure used to assess the performance of the applied techniques are precision and recall per page. It can be seen that without proper clustering, the algorithms that extract the actual data perform very bad. Whether or not clustering performs acceptable heavily depends on the web site. For some sites, URL based clustering outstands (for example: nationalevacaturebank.nl and funda.nl) with precisions of around 33% and recalls of around 85%. URL based clustering is therefore the most promising clustering method reviewed by this research. Of the extraction methods, the existing methods perform better than the alterations proposed by this research. Algorithms that look at the structure (intra page document structure) perform best of all four methods that are compared with an average recall between 30% to 50%, and an average precision ranging from very low (around 2%) to quite low (around 33%). Template induction, an algorithm that compares between pages, performs relatively well as well, however, it is more dependent on the quality of the clusters. The conclusion of this research is that it is not possible yet using a combination of the methods that are discussed and proposed to fully automatically extract data from websites.
Another thesis prize for Niek Tax: Best master thesis in computer science in 2014/2015 at the University of Twente, awarded by Alumni Association ENIAC. Photo: Niek Tax receives the award from Johan Noltes on behalf of the ENIAC jury. Congrats, Niek! Other nominees were Justyna Chromik (DACS), Vincent Bloemen (FMT), Maarten Brilman (HMI), Tim Paauw (IEBIS), and Moritz Müller (SCS).
by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen
With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance “Michael Jackson”, “Islamic State”, or “FC Barcelona” from indexed data in search engines, or hidden data behind web forms, using a minimum number of queries. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. These limitations are also applied in deep web sources, for instance in social networks like Twitter. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine’s limitations. The approach minimizes the number of queries that need to be sent by analysing the retrieved results and combining this analysed information with information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.
To be presented at the 17th International Conference on Information Integration and Web-based Applications & Services on 11 – 13 December 2015 in Brussels, Belgium
Niek Tax was awarded today for his master thesis Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank by the Dutch association for ICT professionals and managers (Nederlandse beroepsvereniging van en voor ICT-professionals en -managers, Ngi-NGN). More information at Ngi-NGN and UT Nieuws. Congratulations, Niek!
Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation
by Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, and Chris Develder
Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.
To be published in Information Retrieval Journal by Springer
Check out the Jupyter IPython Notebook Exercises made for the module Web Science. The exercises closely follow the exercises from Chapter 13 and 14 of the wonderful Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. Download the notebooks here:
Update (February 2016). The notebooks with answers are now available below:
Machine Learning Research at the University of Twente focusses on the application of Machine Learning in Social Signal Processing, Biometric Pattern Recognition, and Text mining. Have a look at our new web site at: http://ml.ewi.utwente.nl.
Co-occurrence Rate Networks: Towards separate training for undirected graphical models
by Zhemin Zhu
Dependence is a universal phenomenon which can be observed everywhere. In machine learning, probabilistic graphical models (PGMs) represent dependence relations with graphs. PGMs find wide applications in natural language processing (NLP), speech processing, computer vision, biomedicine, information retrieval, etc. Many traditional models, such as hidden Markov models (HMMs), Kalman filters, can be put under the umbrella of PGMs. The central idea of PGMs is to decompose (factorize) a joint probability into a product of local factors. Learning, inference and storage can be conducted efficiently over the factorization representation.
Two major types of PGMs can be distinguished: (i) Bayesian networks (directed graphs), and (ii) Markov networks (undirected graphs). Bayesian networks represent directed dependence with directed edges. Local factors of Bayesian networks are conditional probabilities. Directed dependence, directed edges and conditional probabilities are all asymmetric notions. In contrast, Markov networks represent mutual dependence with undirected edges. Both of mutual dependence and undirected edges are symmetric notions. For general Markov networks, based on the Hammersley–Clifford theorem, local factors are positive functions over maximum cliques. These local factors are explained using intuitive notions like ‘compatibility’ or ‘affinity’. Specially, if a graph forms a clique tree, the joint probability can be reparameterized into a junction tree factorization.
In this thesis, we propose a novel framework motivated by the Minimum Shared Information Principle (MSIP): We try to find a factorization in which the information shared between factors is minimum. In other words, we try to make factors as independent as possible.
The benefit of doing this is that we can train factors separately without paying a lot of efforts to guarantee consistency between them. To achieve this goal, we develop a theoretical framework called co-occurrence rate networks (CRNs) to obtain such a factorization. Briefly, given a joint probability, the CRN factorization is obtained as follows. We first strip off singleton probabilities from the joint probability. The quantity left is called co-occurrence rate (CR). CR is a symmetric quantity which measures mutual dependence among variables involved. Then we further decompose the joint CR into smaller and indepen dent CRs. Finally, we obtain a CRN factorization whose factors consist of all singleton probabilities and CR factors. There exist two kinds of independencies between these factors: (i) a singleton probability is independent (Here independent means two factors do not share information.) of other singleton probabilities; (ii) a CR factor is independent of other CR factors conditioned by singleton probabilities. Based on a CRN factorization, we propose an efficient two-step separate training method: (i) in the first step, we train a separate model for each singleton probability; (ii) given singleton probabilities, we train a separate model for each CR factor. Experimental results on three important natural language processing tasks show that our separate training method is two orders of magnitude faster than conditional random fields, while achieving competitive quality (often better on the overall quality metric F1).
The second contribution of this thesis is applying PGMs to a real-world NLP application: open relation extraction (ORE). In open relation extraction, two entities in a sentence are given, and the goal is to automatically extract their relation expression. ORE is a core technique, especially in the age of big data, for transforming unstructured information into structured data. We propose our model SimpleIE for this task. The basic idea is to decompose an extraction pattern into a sequence of simplification operations (components). The benefit by doing this is that these components can be re-combined in a new way to generate new extraction patterns. Hence SimpleIE can represent and capture diverse extraction patterns. This model is essentially a sequence labeling model. Experimental results on three benchmark data sets show that SimpleIE boosts recall and F1 by at least 15% comparing with seven ORE systems.
As tangible outputs of this thesis, we contribute open source implementations of our research results as well as an annotated data set.
Estimating Creditworthiness using Uncertain Online Data
by Maurice Bolhuis
The rules for credit lenders have become stricter since the financial crisis of 2007-2008. As a consequence, it has become more difficult for companies to obtain a loan. Many people and companies leave a trail of information about themselves on the Internet. Searching and extracting this information is accompanied with uncertainty. In this research, we study whether this uncertain online information can be used as an alternative or extra indicator for estimating a company’s creditworthiness and how accounting for information uncertainty impacts the prediction performance.
A data set consisting 3579 corporate ratings has been constructed using the data of an external data provider. Based on the results of a survey, a literature study and information availability tests, LinkedIn accounts of company owners, corporate Twitter accounts and corporate Facebook accounts were chosen as an information source for extracting indicators. In total, the Twitter and Facebook accounts of 387 companies and 436 corresponding LinkedIn owner accounts of this data set were manually searched. Information was harvested from these sources and several indicators have been derived from the harvested information.
Two experiments were performed with this data. In the first experiment, a Naive Bayes, J48, Random Forest and Support Vector Machine classifier was trained and tested using solely these Internet features. A comparison of their accuracy to the 31% accuracy of the ZeroR classifier, which as a rule always predicts the most occurring target class, showed that none of the models performed statistically better. In a second experiment, it was tested whether combining Internet features with financial data increases the accuracy. A financial data mining model was created that approximates the rating model of the ratings in our data set and that uses the same financial data as the rating model. The two best performing financial models were built using the Random Forest and J48 classifiers with an accuracy of 68% and 63% respectively. Adding Internet features to these models gave mixed results with a significant decrease and an insignificant increase respectively.
An experimental setup for testing how incorporating uncertainty affects the prediction accuracy of our model is explained. As part of this setup, a search system is described to find candidate results of online information related to a subject and to classify the degree of uncertainty of this online information. It is illustrated how uncertainty can be incorporated into the data mining process.
We are proud to announce the 12th Seminar on Searching and Ranking, with guest presentations by Ingo Frommholz from the University of Bedfordshire, UK, and Tom Heskes from Radboud University Nijmegen, the Netherlands.
More information at: SSR 12.