2014 – Page 2 – Djoerd Hiemstra

Seminar on Cyberbullying

On Friday 12 September the 11th SIKS/Twente Seminar of Searching and Ranking (SSR) takes place discussing Cyberbullying. The goal of the seminar is to bring together researchers from academia and organizations working on the development of strategies and solutions to understand, detect and prevent cyberbullying incidents among adolescents. Invited speakers are:

Prof. Debra Pepler (York University, Canada)
Prof. Veronique Hoste (Ghent University, Belgium)

More information at: SSR-11.

Designing a Deep Web Harvester

by Mohamamdreza Khelghati, Maurice van Keulen, and Djoerd Hiemstra

To make deep web data accessible, harvesters have a crucial role. Targeting different domains and websites requires the need of a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all influential elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HFW) or a harvester (HFH) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' or harvesters' features. These elements are gathered from literature or introduced through the authors' experiments. In addition to enabling designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing harvesters. Designers can define the list of features and prioritize their implementations. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To validate the HFH as an evaluation metric, it is shown how it can be calculated for the harvester implemented by the authors. The results show that the developed harvester works pretty well for the targeted test set by a score of 14.8 of 15.

To be presented at Riva del Garda, Trentino, Italy at the Workshop on Surfacing the Deep and the Social Web (SDSW 2014), a workshop co-located with The 13th International Semantic Web Conference

Roeland Kegel graduates on developing a personal information security assistant

Development and Validation of a Personal Information Security Assistant Architecture

by Roeland Kegel

This thesis presents and validates the first iteration of the design process of a Personal Information Security Assistant (PISA). The PISA aims to protect the information and devices of an end-user, offering advice and education in order to improve the security and awareness of its users. The PISA is a security solution that takes a user-centric approach, aiming to educate as well as protect, to motivate as well as secure. This thesis first presents the method and its application by which stakeholders are elicited and classified. Requirements are then elicited using these stakeholders. 4 architectural alternatives for PISA are then proposed. Finally, these alternatives are validated by a traceability analysis, a prototype implementation of a specific alternative and feedback by a focus group of experts. In summary, this thesis presents stakeholders, goals, requirements and proposed architectures for the PISA and contains a validation of the latter.

[download pdf]

Databases: Last chance

TOM or MOOC?

This year, students can pass our course “Gegevensbanken” (Dutch for “Databases”) by following the massive open online course DB 2014 from Stanford University

In 2013/2014 we offered our course “Gegevensbanken” (Dutch for Databases) for the last time. From this year on, students will study Databases as part of the module Data and Information. We advise students that failed the course last year, and that do not want to enroll in the Twente Educational Model (TOM: Twents Onderwijs Model), to follow the excellent Stanford DB 2014 massive open on-line course (MOOC) by Jenifer Widom.

The course takes the following 8 lectures (mini courses) from the full Stanford Database class. Students should finish roughly one mini course each week in Quarter 1.

Students that submit their Statements of Accomplishment via Blackboard for each mini course (except the Introduction mini course) will get 0.2 bonus grade point per Statement, 1.5 bonus grade points if all Statements of Accomplishment are submitted. There are two exams for “Gegevensbanken”:

30 October, 13.45 h. – 16.45 h.
30 January, 8.45 h. – 11.45 h. (resit)

Students that fail both exams can enroll in the new Twente Educational Model bachelor module “Data & Information” which takes place in Quarter 4.

Linear Co-occurrence Rate Networks for Sequence Labeling

by Zhemin Zhu, Djoerd Hiemstra, and Peter Apers

Sequence labeling has wide applications in natural language processing and speech processing. Popular sequence labeling models suffer from some known problems. Hidden Markov models (HMMs) are generative models and they cannot encode transition features; Conditional Markov models (CMMs) suffer from the label bias problem; And training of conditional random fields (CRFs) can be expensive. In this paper, we propose Linear Co-occurrence Rate Networks (L-CRNs) for sequence labeling which avoid the mentioned problems with existing models. The factors of L-CRNs can be locally normalized and trained separately, which leads to a simple and efficient training method. Experimental results on real-world natural language processing data sets show that L-CRNs reduce the training time by orders of magnitudes while achieve very competitive results to CRFs.

[download pdf]

The paper will be presented at the International Conference on Statistical Language and Speech Processing (SLSP) in Grenoble, France on October 14-16, 2014

Our C++ implementation of L-CRNs and the datasets used in this paper can be found on Github.

Tesfay Aregay graduates on Ranking Factors for Web Search

Ranking Factors for Web Search : Case Study In The Netherlands

by Tesfay Aregay

It is essential for search engines to constantly adjust ranking function to satisfy their users, at the same time SEO companies and SEO specialists are observed trying to keep track of the factors prioritized by these ranking functions. In this thesis, the problem of identifying highly influential ranking factors for better ranking on search engines is examined in detail, looking at two different approaches currently in use and their limitations. The first approach is to calculate correlation coefficient (e.g. Spearman rank) between a factor and the rank of it's corresponding webpages (ranked document in general) on a particular search engine. The second approach is to train a ranking model using machine learning techniques, on datasets and select the features that contributed most for a better performing ranker. We present results that show whether or not combining the two approaches of feature selection can lead to a significantly better set of factors that improve the rank of webpages on search engines. We also provide results that show calculating correlation coefficients between values of ranking factors and a webpage's rank gives stronger result if a dataset that contains a combination of top few and least few ranked pages is used. In addition list of ranking factors that have higher contribution to well-ranking webpages, for the Dutch web dataset (our case study) and LETOR dataset are provided.

[download pdf]

Photo by @Indenty.

Evaluate FedWeb runs online

The TREC Federated web track provides a new online tool to check the syntax of your runs and provide preliminary evaluation results on 10 of the 75 provided topics. Now you can easily see how you compare to other runs submitted to the system. The official TREC evaluation results will be based on at least 50 of the remaining topics in your run. Check your run at:
http://circus.ewi.utwente.nl/fedweb/.

Please note that the site does NOT submit runs to TREC. Submit your runs at TREC via the TREC active participants site: before August 18, 2014 (Resource & Vertical Selection); before September 15, 2014 (Results Merging).

Follow @TRECFedWeb on Twitter.

Comparison of Local and Global Undirected Graphical Models

by Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher

Conditional Random Fields (CRFs) are discriminative undirected models which are globally normalized. Global normalization preserves CRFs from the label bias problem (LBP) which most local models suffer from. Recently proposed co-occurrence rate networks (CRNs) are also discriminative undirected models. In contrast to CRFs, CRNs are locally normalized. It was established that CRNs are immune to the LBP although they are local models. In this paper, we further compare these two models. The connection between CRNs and Copula are built in continuous case. Also their strength and weakness are further evaluated statistically by experiments.

[download pdf]

The paper was presented at the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) in Bruges (Belgium) on 23-25 April 2014.

Eight questions about Big Data

Iedereen heeft het de laatste tijd over Big Data. Maar wat is het eigenlijk? Waarom is het zo’n big deal? En hoe kun je verantwoord met Big Data omgaan? Acht vragen over Big Data, samen met Peter-Paul Verbeek, Elmer Lastdrager, Oscar Olthoff en Floris Kreiken.

The Importance of Prior Probabilities for Entry Page Search

by Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra

An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability

[download pdf]

The paper was published at SIGIR 2002 and received an Honourable Mention for the ACM SIGIR Test of Time award at the 37th Annual ACM SIGIR conference on Research & development in information retrieval in Gold Coast Australia on 9 July 2014.