Dueling bandits for online ranker evaluation
by Masrour Zoghi
In every domain where a service or a product is provided, an important question is that of evaluation: given a set of possible choices for deployment, what is the best one? An important example, which is considered in this work, is that of ranker evaluation from the field of information retrieval (IR). The goal of IR is to satisfy the information need of a user in response to a query issued by them, where this information need is typically satisfied by a document (or a small set of documents) contained in what is often a much larger collection. This goal is often attained by ranking the documents according to their usefulness to the issued query using an algorithm, called a ranker, a procedure that takes as input a query and a set of documents and specifies how the documents need to be ordered.
This thesis is concerned with ranker evaluation. The goal of ranker evaluation is to determine the quality of the rankers under consideration to allow us to choose the best option: given a finite set of possible rankers, which one of them leads to the highest level of user satisfaction? There are two main methods for carrying this out: absolute metrics and relative comparisons. This thesis is concerned with the second, relative form of ranker evaluation because it is more efficient at distinguishing between rankers of different quality: for instance interleaved comparisons take a fraction of the time required by A/B testing, but they produce the same outcome. More precisely, the problem of online ranker evaluation from relative feedback can be described as follows: given a finite set of rankers, choose the best using only pairwise comparisons between the rankers under consideration, while minimizing the number of comparisons involving sub-optimal rankers. This problem is an instance of what is referred to as the dueling bandit problem in the literature.
The main contribution of this thesis is devising a dueling bandit algorithm, called Copeland Confidence Bounds (CCB), that solves this problem under practically general assumptions and providing theoretical guarantees for its proper functioning. In addition to that, the thesis contains a number of other algorithms that are better suited for dueling bandit problems with particular properties.
by Wim van der Zijden
A good practice in business is to focus on key activities. For some companies this may be branding, while other businesses may focus on areas such as consultancy, production or distribution. Focusing on key activities means to outsource as much other activities as possible. These other activities merely distract from the main goals of the company and the company will not be able to excel in them.
Many companies are in need of reliable software to persistently process live data transactions and enable reporting on this data. To fulfil this need, they often have large IT departments in-house. Those departments are costly and distract from the company’s main goals. The emergence of cloud computing should make this no longer necessary. All they need is an internet connection and a service contract with an external provider.
However, most businesses are in need of highly customizable software, because each company has slightly different business processes, even those in the same industry. So even if they outsource their IT need, they will still have to pay expensive developers and business analysts to customize some existing application.
These issues are addressed by Multi-Tenant Customizable (MTC) applications. We define such an application as follows:
A single software solution that can be used by multiple organizations at the same time and which is highly customizable for each organization and user within that organization, by domain experts without a technical background.
A key challenge in designing such a system is to develop a proper persistent data storage, because mainstream databases are optimized for single tenant usage. To this end this Master’s thesis consists of two papers: the first paper proposes an MTC-DB Benchmark, MTCB. This Benchmark allows for objective comparison and evaluation of MTC-DB implementations, as well as providing a framework for the definition of MTC-DB. The second paper describes a number of MTC-DB implementations and uses the benchmark to evaluate those implementations.
Does Google doubt whether the holocaust happened?
Query autocompletion algorithms that are based on query logs are problematic in two important ways: 1) They return offensive and damaging results; 2) They suffer from a destructive feedback loop.
We are hiring two PhD positions on Maintenance Optimization for the Dutch railroads.
The Database and Formal Methods & Tools groups at the University of Twente seek two PhD candidates for SEQUOIA: Smart maintenance optimization via big data & fault tree analysis, a project funded by the Dutch Technology Foundation STW, and the companies ProRail and NS. ProRail is responsible for the Dutch railway network, including its construction, management, maintenance, and safety; NS has the same responsibility for the Dutch train fleed.
SEQUOIA aims to improve the reliability of the Dutch railroads by deploying big data analytics to predict and prevent failures. Its scientific core is a novel combination of machine learning, fault tree analysis and stochastic model checking. Key idea is that big data analytics provide the statistics on failures, their correlations, dependencies etc. and fault trees provide the domain knowledge needed to interpret these data. The project outcome aims at fewer train disruptions and delays, lower maintenance cost and more passenger comfort. The project involves an intense cooperation with the RWTH Aachen University and with various engineers from ProRail and NS. The PhD candidates will spend a portion of their time at the ProRail / NS sites in Utrecht.
The Dutch chapter of the Internet Society (ISOC) nominated Searsia for its 2017 Innovation Award.
Read more on the Searsia blog
by Djoerd Hiemstra and Robin Aly
SIKS/CBS DataCamp participants can download the answers for the Jupyter Scala/Spark notebook exercises below.
by Fieke Hillerström
Deep learning for biometrics has increasingly gained attention over the last years. Due to the expansion of computational power and the increasing sizes of the available datasets, the performance has surpassed that of humans on certain verification tasks. However, large datasets are not available for every application. Therefore we introduce Deep Verification Learning, to reduce network complexity and train with more modest hardware on smaller datasets. Deep Verification Learning takes two images to be verified at the input of a deep learning network, and trains directly towards a verification score. This topology enables the network to learn differences and similarities in the first layer, and to involve verification signals during training. Directly training towards a verification score reduces the number of trainable parameters significantly. We applied Deep Verification Learning on the face verification task, also it could be extended to other biometric modalities. We compared our face verification learning topology with a network trained for multi-class classification on the FRGC dataset, which contains only 568 subjects. Deep Verification Learning performs substantially better.
by Joe Laufer, Mariëlle Winkler, Djoerd Hiemstra, and Susanne de Gooijer (Free Spirits UniTe)
Many employees of the University of Twente spend part of their free time on volunteer projects that are directly beneficial to society. They use their expertise and professional knowledge for instance by teaching children, by lecturing students in developing countries, by supporting elderly with new technology, or by rewriting an NGO's strategic plan. These employees struggle allocating enough free time for their volunteer work, whereas others might not need all their vacations days. Our proposal is simple: Employees can take their vacation days, and give them to employees that are in need because of their volunteer work. Our proposal extends the university's vision “the entrepreneurial university” to explicitly support projects with societal impact. We envision the following steps:
- Employees initiate projects that have societal impact and ask support of the campus community;
- Employees can donate one or more vacation days to become part of the community, and to form a pool of additional free time, to be used by the project initiators;
- Project initiators pitch their ideas to the community, similiar to pitches for crowd funding platforms like Kickstarter;
- The community can vote for the projects of their choice;
- Once enough vacations days are donated to an initiative, the project initiator can use the extra time off to carry out their initiative (in addition to the time that the employee already puts in their initiative);
- Students can participate in projects for credits (starting with building an on-line community platform);
- Alumni can sponsor intiatives financially, share their network, and coach the project initiators;
- Project initiators share their experience and accomplishments to the community, for instance by blogging about their project;
- Initiatives should be done in cooperation with an NGO.
We are proud that our proposal is accepted as one of the Living Smart Campus projects.
by Rutger Varkevisser
The internet is an incredible resource for information and learning. By using search engines like Google, information is usually just a click away. Unless you are a child, in which case most of the information on the web is either (way) too difficult to read and/or understand, or impossible to find. This research aims to successfully combine the areas of readability assessment and gamification in order to provide a tech- nical and theoretical foundation for the creation of an automatic large scale child feedback readability assessment system. In which correctly assessing the readability level of online (textual) content for children is the central focus. The importance of having correct readability scores for online content, is that it provides children with a guideline on the difficulty level of textual content on the web. It also allows for external programs i.e. search engines, to potentially take readability scores into account based on the known age/proficiency of the user. Having children actively participate in the process of determining readability levels should improve any current systems which usually rely on fully automated systems/algorithms or human (adult) perception.
The first step in the creation of the aforementioned tool is to make sure the underlying process is scientific valid. This research has adapted the Cloze-test as a method of determining the readability of a text. The Cloze-test is an already established and researched method of readability assessment, which works by omitting certain words from a text and tasking the user with filling in the open spots with the correct words. The resulting overall score determining the readability level. For this research we want to digitize and automate this process. However, while the validity of the Cloze-test and its results in an offline (paper) environment have been proven, this is not the case for any digital adaptation. Therefore the first part of this research focusses on this central issue. By combining the areas of readability assessment (the Cloze-test), gamification (the creation of a digital online adaptation of the Cloze-test) and child computer interaction (a user-test on the target audience with the developed tool) this validity was examined and tested. In the user-test the participants completed several different Cloze-test texts, half of them offline (on paper) and the other half in a recreated online environment. This was done to measure the correlation between the online scores and the offline scores, which we already know are valid. Results of the user-test confirmed the validity of the online version by showing significant correlations between the offline and online versions via both a Pearson correlation coefficient and Spearman’s rank-order analysis.
With the knowledge that the online adaptation of the Cloze-test is valid for determining readability scores, the next step was to automate the process of creating Cloze-tests from texts. Given that the goal of the project was to provide the basis of a scalable gamified approach, and scalable in this context means automated. Several methods were developed to mimic the human process of creating a Cloze-test (i.e. looking at the text and selecting which words to omit given a set of general guidelines). Included in these methods were TF.IDF and NLP approaches in order to find suitable extraction words for the purposes of a Cloze-test. These were tested by comparing the classification performance of each method with a baseline of manually classified/marked set of texts. The final versions of the aforementioned methods were tested, and resulted performance scores of around 50%, i.e. how well they emulated human performance in the creation of Cloze-tests. A combination of automated methods resulted in an even bigger performance score of 63%. The best performing individual method was put to the test in a small Turing-test style user-test which showed promising results. Presented with 2 manually- and 1 automatically created Cloze-test participants attained similar scores across all tests. Participants also gave contradicting responses when asked which of the 3 Cloze-tests was automated. This research concludes the following:
- Results of offline- and online Cloze-tests are highly correlated.
- Automated methods are able to correctly identify 63% of suitable Cloze-test words as marked by humans.
- Users gave conflicting reports when asked to identify the automated test in a mix of both automated- and human-made Cloze-tests.
We are proud of the Information Processing & Management Best Paper Award 2015 for our paper: A cross-benchmark comparison of 87 learning to rank methods.
Published in Information Processing and Management 51(6), pages 757–772