As of 1 July, I will leave the U. Twente after almost 30 years (first as student, then as PhD student, finally as staff member) for a new challenge at the Radboud University in Nijmegen. I am proud to announce that I will join Radboud University’s faculty of science as professor of Federated Search.
I was privileged to teach in a world that changed a lot since I became an assistant professor (in 2001). Today, university-level courses are no longer taught for the privileged few at universities in developed countries. They are now freely available to anyone online via platforms like Coursera, edX, FutureLearn and on social media, such as on YouTube. Over the last 18 years, I tried to stimulate students to find additional study material online. In return I tried to contribute to the online study material by publishing my teaching material for students to use and for colleagues to share (my Canvas courses are still entirely publicly available) and by using novel social media like UT Mastodon (https://mastodon.utwente.nl).
In my years at the UT, I enjoyed promoting critical thinking by letting students actively put theory to practice, instead of letting students passively absorb knowledge. I particularly enjoyed developing the MSc course Managing Big Data with Maarten Fokkinga and Robin Aly (later perfected by Doina Bucur) where students analysed terabytes of data on a large Hadoop cluster. I enjoyed developing the BSc module Data & Information with Klaas Sikkel, Maurice van Keulen and Luís Ferreira Pires, where we let students work in agile teams, including daily stand-ups, sprint review meetings, and sprint backlogs. I also very much liked running the MSc course Information Retrieval with Paul van der Vet, Theo Huibers and Dolf Trieschnigg, where students used open source search engines and actively contributed to our research. Some of that work was published, and in such cases, students presented their work at international workshops or conferences.
Saying goodbye to Twente is harder than I expected. But remember, Nijmegen is close by: Feel free to contact me. As for PhD students, I intend to continue to be an active contributor to the courses of the Dutch research school SIKS: I hope to see you there.
To celebrate Peter Apers’ retirement, we created The Apers Tree, which displays the Academic Genealogy of Peter Apers. The tree is inspired by the wonderful Mathematics Genealogy Project and a gift from the Database Group of the University Twente on the occasion of Peter’s retirement on 16 February 2018.
Check out the Apers Tree on Github.
by Wim van der Zijden, Djoerd Hiemstra, and Maurice van Keulen
We argue that there is a need for Multi-Tenant Customizable OLTP systems. Such systems need a Multi-Tenant Customizable Database (MTC-DB) as a backing. To stimulate the development of such databases, we propose the benchmark MTCB. Benchmarks for OLTP exist and multi-tenant benchmarks exist, but no MTC-DB benchmark exists that accounts for customizability. We formulate seven requirements for the benchmark: realistic, unambiguous, comparable, correct, scalable, simple and independent. It focuses on performance aspects and produces nine metrics: Aulbach compliance, size on disk, tenants created, types created, attributes created, transaction data type instances created per minute, transaction data type instances loaded by ID per minute, conjunctive searches per minute and disjunctive searches per minute. We present a specification and an example implementation in Java 8, which can be accessed from the following public repository. In the same repository a naive implementation can be found of an MTC-DB where each tenant has its own schema. We believe that this benchmark is a valuable contribution to the community of MTC-DB developers, because it provides objective comparability as well as a precise definition of the concept of MTC-DB.
The Multi-Tenant Customizable database Benchmark will be presented at the 9th International Conference on Information Management and Engineering (ICIME 2017) on 9-11 October 2017 in Barcelona, Spain.
The paper Evaluation and analysis of term scoring methods for term extraction with Suzan Verberne, Maya Sappelli and Wessel Kraaij is selected as one of ACM Computing Reviews' 2016 Best of Computing. Computing Reviews is published by the Association for Computing Machinery (ACM) and the editor-in-chief is Carol Hutchins (New York University).
In the paper, we evaluate five term scoring methods for automatic term extraction on four different types of text collections. We show that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.
Low latency asynchronous database synchronization and data transformation using the replication log
by Vincent van Donselaar
Analytics firm Distimo offers a web based product that allows mobile app developers to track the performance of their apps across all major app stores. The Distimo backend system uses web scraping techniques to retrieve the market data which is stored in the backend master database: the data warehouse (DWH). A batch-oriented program periodically synchronizes relevant data to the frontend database that feeds the customer-facing web interface.
The synchronization program poses limitations due to its batch-oriented design. The relevant metadata that must be calculated before and after each batch results in overhead and increased latency. The goal of this research is to streamline the synchronization process by moving to a continuous, replication-like solution, combined with principles seen in the field of data warehousing. The binary transaction log of the master database is used to feed the synchronization program that is also responsible for implicit data transformations like aggregation and metadata generation. In contrast to traditional homogeneous database replication, this design allows synchronization across heterogeneous database schemas. The prototype demonstrates that a composition of replication and data warehousing techniques can offer an adequate solution for robust and low latency data synchronization software.
Development and Validation of a Personal Information Security Assistant Architecture
by Roeland Kegel
This thesis presents and validates the first iteration of the design process of a Personal Information Security Assistant (PISA). The PISA aims to protect the information and devices of an end-user, offering advice and education in order to improve the security and awareness of its users. The PISA is a security solution that takes a user-centric approach, aiming to educate as well as protect, to motivate as well as secure. This thesis first presents the method and its application by which stakeholders are elicited and classified. Requirements are then elicited using these stakeholders. 4 architectural alternatives for PISA are then proposed. Finally, these alternatives are validated by a traceability analysis, a prototype implementation of a specific alternative and feedback by a focus group of experts. In summary, this thesis presents stakeholders, goals, requirements and proposed architectures for the PISA and contains a validation of the latter.
by Djoerd Hiemstra, John Tait, Andrew MacFarlane, and Nick Belkin
Stephen Robertson was named fellow of the Association for Computing Machinery (ACM) last week. Robertson retired from the Microsoft Research Lab in Cambridge this year after a long career as one of the most influential, well liked and eminent researchers in Information Retrieval throughout the world. His successful career was celibrated in the latest BCS IRSG Informer. Stephen Robertson continues to be active in Information Retrieval in his retirement at University College London.
Today would have been the 41st birthday of Joost van Honschoten, who passed away almost 2 years ago. Joost was a talented young researcher, holding grants from STW and NWO, working as a professor at the Transducers Science and Technology Group of the Unversity of Twente. Joost and I published several “papers” together around 1983, not as researchers, but as comic book writers when we were about 11 and 12 years old. One of them, “Honne & Ponnie en de Jacht op Ruige Robbie” can be downloaded from the link below. The comic gives an idea of the friendship, creativity and humour that we shared.
Scalable identity extraction and ranking in Tracks Inspector
by Jop Hofste
The digital forensic world deals with a growing amount of data which should be processed. In general, investigators do not have the time to manually analyze all the digital evidence to get a good picture of the suspect. Most of the time investigations contain multiple evidence units per case. This research shows the extraction and resolution of identities out of evidence data. Investigators are supported in their investigations by proposing the involved identities to them. These identities are extracted from multiple heterogeneous sources like system accounts, emails, documents, address books and communication items. Identity resolution is used to merge identities at case level when multiple evidence units are involved.
The functionality for extracting, resolving and ranking identities is implemented and tested in the forensic tool Tracks Inspector. The implementation in Tracks Inspector is tested on five datasets. The results of this are compared with two other forensic products, Clearwell and Trident, on the extent to which they support the identity functionality. Tracks Inspector delivers very promising results compared to these products, it extracts more or the same number of the relevant identities in their top 10 identities compared to Clearwell and Trident. Tracks Inspector delivers a high accuracy, compared to Clearwell it has a better precision and the recall is approximately equal what results from the tests.
The contribution of this research is to show a method for the extraction and ranking of identities in Tracks Inspector. In the digital forensic world it is a quite new approach, because no other software products support this kind of functionality. Investigations can now start by exploring the most relevant identities in a case. The nodes which are involved in an identity can be quickly recognized. This means that the evidence data can be filtered at an early-stage.
Integrating a social network into an administration system for primary education
by Mark Kazemier
Research of the Dutch educational inspectorate shows that there are still many problems within Dutch primary education (Inspectie van het onderwijs, 2010). Topicus creates a pupil administration system ParnasSys that tries to solve these problems for the primary education. Two of these problems are not solved by ParnasSys however. Teachers are uncertified and teaching material is often bad. With the recent increase in popularity of social networks, Topicus sees opportunities. This study shows a social network should be integrated into ParnasSys as a stand-alone application. This means that when users log-in to ParnasSys they get a new option to go to the social network, but the existing parts do not connect directly to the network.
Existing theory and implementations of social networks in education and corporations shows that social networking creates new relationships between people that otherwise would not have existed. This leads to access to more information, new experience and creation of new content. The creation of new content can help teachers to select better teaching material, enhance their current teaching material and find solutions to issues they currently have in the classroom. They can also share their own experiences with others helping other teachers increase their skills and experiences.
When integrating a social network within ParnasSys there are two issues that need to be mitigated: 1) Copyright, 2) Privacy. Copyright can easily be mitigated by automatically posting all content on the network with a creative commons attribution license. This means that everyone can use the content as long as they mention the author. When people post content to the network that is copyrighted it can be removed when a takedown notice or report is received. Privacy is a more subtle issue. While privacy controls mitigate most of the issues. Some issues subsist. For example when a teacher posts something about a pupil and the parent of this pupil is also a teacher with access to ParnasSys this could lead to issues. The only way to mitigate this issue is by educating the users that those privacy issues exist.
It is recommended to integrate a social network within ParnasSys. There are two possibilities for further research. First the research recommends to integrate the social network as a stand-alone application as start, but it is recommended to look further into possibilities to connect several existing parts of ParnasSys with the network. For example pages with information of tests could integrate with the network where several users can work together on these tests. Second, finding of information gets more important when the network gets more users. While there are no issues found on finding of information in the interviews with users, this could become an issue in the future. It is therefore recommended to test several search methods and measure how many users use these methods to find their needed information.