MTCB: A Multi-Tenant Customizable database Benchmark

by Wim van der Zijden, Djoerd Hiemstra, and Maurice van Keulen

We argue that there is a need for Multi-Tenant Customizable OLTP systems. Such systems need a Multi-Tenant Customizable Database (MTC-DB) as a backing. To stimulate the development of such databases, we propose the benchmark MTCB. Benchmarks for OLTP exist and multi-tenant benchmarks exist, but no MTC-DB benchmark exists that accounts for customizability. We formulate seven requirements for the benchmark: realistic, unambiguous, comparable, correct, scalable, simple and independent. It focuses on performance aspects and produces nine metrics: Aulbach compliance, size on disk, tenants created, types created, attributes created, transaction data type instances created per minute, transaction data type instances loaded by ID per minute, conjunctive searches per minute and disjunctive searches per minute. We present a specification and an example implementation in Java 8, which can be accessed from the following public repository. In the same repository a naive implementation can be found of an MTC-DB where each tenant has its own schema. We believe that this benchmark is a valuable contribution to the community of MTC-DB developers, because it provides objective comparability as well as a precise definition of the concept of MTC-DB.

The Multi-Tenant Customizable database Benchmark will be presented at the 9th International Conference on Information Management and Engineering (ICIME 2017) on 9-11 October 2017 in Barcelona, Spain.

[download pdf]

Term Extraction paper in Computing Reviews’ Best of 2016

CR Best of Computing Notable Article The paper Evaluation and analysis of term scoring methods for term extraction with Suzan Verberne, Maya Sappelli and Wessel Kraaij is selected as one of ACM Computing Reviews' 2016 Best of Computing. Computing Reviews is published by the Association for Computing Machinery (ACM) and the editor-in-chief is Carol Hutchins (New York University).

In the paper, we evaluate five term scoring methods for automatic term extraction on four different types of text collections. We show that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

[download pdf]

Vincent van Donselaar graduates on database synchronization

Low latency asynchronous database synchronization and data transformation using the replication log

by Vincent van Donselaar

Analytics firm Distimo offers a web based product that allows mobile app developers to track the performance of their apps across all major app stores. The Distimo backend system uses web scraping techniques to retrieve the market data which is stored in the backend master database: the data warehouse (DWH). A batch-oriented program periodically synchronizes relevant data to the frontend database that feeds the customer-facing web interface.
The synchronization program poses limitations due to its batch-oriented design. The relevant metadata that must be calculated before and after each batch results in overhead and increased latency. The goal of this research is to streamline the synchronization process by moving to a continuous, replication-like solution, combined with principles seen in the field of data warehousing. The binary transaction log of the master database is used to feed the synchronization program that is also responsible for implicit data transformations like aggregation and metadata generation. In contrast to traditional homogeneous database replication, this design allows synchronization across heterogeneous database schemas. The prototype demonstrates that a composition of replication and data warehousing techniques can offer an adequate solution for robust and low latency data synchronization software.

[download pdf]

Roeland Kegel graduates on developing a personal information security assistant

Development and Validation of a Personal Information Security Assistant Architecture

by Roeland Kegel

This thesis presents and validates the first iteration of the design process of a Personal Information Security Assistant (PISA). The PISA aims to protect the information and devices of an end-user, offering advice and education in order to improve the security and awareness of its users. The PISA is a security solution that takes a user-centric approach, aiming to educate as well as protect, to motivate as well as secure. This thesis first presents the method and its application by which stakeholders are elicited and classified. Requirements are then elicited using these stakeholders. 4 architectural alternatives for PISA are then proposed. Finally, these alternatives are validated by a traceability analysis, a prototype implementation of a specific alternative and feedback by a focus group of experts. In summary, this thesis presents stakeholders, goals, requirements and proposed architectures for the PISA and contains a validation of the latter.

[download pdf]

Celebrating Stephen Robertson’s Retirement

by Djoerd Hiemstra, John Tait, Andrew MacFarlane, and Nick Belkin

Stephen Robertson at SIGIR 2013 Stephen Robertson was named fellow of the Association for Computing Machinery (ACM) last week. Robertson retired from the Microsoft Research Lab in Cambridge this year after a long career as one of the most influential, well liked and eminent researchers in Information Retrieval throughout the world. His successful career was celibrated in the latest BCS IRSG Informer. Stephen Robertson continues to be active in Information Retrieval in his retirement at University College London.

[download pdf]

In memory of Joost van Honschoten

Today would have been the 41st birthday of Joost van Honschoten, who passed away almost 2 years ago. Joost was a talented young researcher, holding grants from STW and NWO, working as a professor at the Transducers Science and Technology Group of the Unversity of Twente. Joost and I published several “papers” together around 1983, not as researchers, but as comic book writers when we were about 11 and 12 years old. One of them, “Honne & Ponnie en de Jacht op Ruige Robbie” can be downloaded from the link below. The comic gives an idea of the friendship, creativity and humour that we shared.

Honne en Ponnie en de Jacht op Ruige Robbie
HonneEnPonnieDeel1.pdf

Jop Hofste graduates on identity ranking in digital evidence data

Scalable identity extraction and ranking in Tracks Inspector

by Jop Hofste

The digital forensic world deals with a growing amount of data which should be processed. In general, investigators do not have the time to manually analyze all the digital evidence to get a good picture of the suspect. Most of the time investigations contain multiple evidence units per case. This research shows the extraction and resolution of identities out of evidence data. Investigators are supported in their investigations by proposing the involved identities to them. These identities are extracted from multiple heterogeneous sources like system accounts, emails, documents, address books and communication items. Identity resolution is used to merge identities at case level when multiple evidence units are involved.

The functionality for extracting, resolving and ranking identities is implemented and tested in the forensic tool Tracks Inspector. The implementation in Tracks Inspector is tested on five datasets. The results of this are compared with two other forensic products, Clearwell and Trident, on the extent to which they support the identity functionality. Tracks Inspector delivers very promising results compared to these products, it extracts more or the same number of the relevant identities in their top 10 identities compared to Clearwell and Trident. Tracks Inspector delivers a high accuracy, compared to Clearwell it has a better precision and the recall is approximately equal what results from the tests.

The contribution of this research is to show a method for the extraction and ranking of identities in Tracks Inspector. In the digital forensic world it is a quite new approach, because no other software products support this kind of functionality. Investigations can now start by exploring the most relevant identities in a case. The nodes which are involved in an identity can be quickly recognized. This means that the evidence data can be filtered at an early-stage.

[download pdf]

Mark Kazemier graduates on social networks for primary education teachers

Integrating a social network into an administration system for primary education

by Mark Kazemier

Research of the Dutch educational inspectorate shows that there are still many problems within Dutch primary education (Inspectie van het onderwijs, 2010). Topicus creates a pupil administration system ParnasSys that tries to solve these problems for the primary education. Two of these problems are not solved by ParnasSys however. Teachers are uncertified and teaching material is often bad. With the recent increase in popularity of social networks, Topicus sees opportunities. This study shows a social network should be integrated into ParnasSys as a stand-alone application. This means that when users log-in to ParnasSys they get a new option to go to the social network, but the existing parts do not connect directly to the network.
Existing theory and implementations of social networks in education and corporations shows that social networking creates new relationships between people that otherwise would not have existed. This leads to access to more information, new experience and creation of new content. The creation of new content can help teachers to select better teaching material, enhance their current teaching material and find solutions to issues they currently have in the classroom. They can also share their own experiences with others helping other teachers increase their skills and experiences.
When integrating a social network within ParnasSys there are two issues that need to be mitigated: 1) Copyright, 2) Privacy. Copyright can easily be mitigated by automatically posting all content on the network with a creative commons attribution license. This means that everyone can use the content as long as they mention the author. When people post content to the network that is copyrighted it can be removed when a takedown notice or report is received. Privacy is a more subtle issue. While privacy controls mitigate most of the issues. Some issues subsist. For example when a teacher posts something about a pupil and the parent of this pupil is also a teacher with access to ParnasSys this could lead to issues. The only way to mitigate this issue is by educating the users that those privacy issues exist.
It is recommended to integrate a social network within ParnasSys. There are two possibilities for further research. First the research recommends to integrate the social network as a stand-alone application as start, but it is recommended to look further into possibilities to connect several existing parts of ParnasSys with the network. For example pages with information of tests could integrate with the network where several users can work together on these tests. Second, finding of information gets more important when the network gets more users. While there are no issues found on finding of information in the interviews with users, this could become an issue in the future. It is therefore recommended to test several search methods and measure how many users use these methods to find their needed information.

[download pdf]

ACM SIGIR honors Norbert Fuhr

Norbert Fuhr For pioneering contributions to approaches that now dominate the search industry, ACM SIGIR honors Norbert Fuhr from the University of Duisburg-Essen (Germany) with the 2012 Gerard Salton Award. Fuhr developed probabilistic retrieval models for databases and XML, and his research on probabilistic models anticipated the current interest in learning to rank approaches in search operations. Fuhr received the award at the ACM SIGIR Conference in Portland, Oregon, USA, where he gave the opening keynote address. Read more in the ACM Press release.