Jop Hofste graduates on identity ranking in digital evidence data

Scalable identity extraction and ranking in Tracks Inspector

by Jop Hofste

The digital forensic world deals with a growing amount of data which should be processed. In general, investigators do not have the time to manually analyze all the digital evidence to get a good picture of the suspect. Most of the time investigations contain multiple evidence units per case. This research shows the extraction and resolution of identities out of evidence data. Investigators are supported in their investigations by proposing the involved identities to them. These identities are extracted from multiple heterogeneous sources like system accounts, emails, documents, address books and communication items. Identity resolution is used to merge identities at case level when multiple evidence units are involved.

The functionality for extracting, resolving and ranking identities is implemented and tested in the forensic tool Tracks Inspector. The implementation in Tracks Inspector is tested on five datasets. The results of this are compared with two other forensic products, Clearwell and Trident, on the extent to which they support the identity functionality. Tracks Inspector delivers very promising results compared to these products, it extracts more or the same number of the relevant identities in their top 10 identities compared to Clearwell and Trident. Tracks Inspector delivers a high accuracy, compared to Clearwell it has a better precision and the recall is approximately equal what results from the tests.

The contribution of this research is to show a method for the extraction and ranking of identities in Tracks Inspector. In the digital forensic world it is a quite new approach, because no other software products support this kind of functionality. Investigations can now start by exploring the most relevant identities in a case. The nodes which are involved in an identity can be quickly recognized. This means that the evidence data can be filtered at an early-stage.

[download pdf]

Size estimation of non-cooperative data collections

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

In this paper, approaches for estimating the size of non-cooperative databases and search engines are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.

To be presented at the 14th International Conference on Information Integration and Web-based Applications and Services (iiWAS 2012) on 3-5 December 2012 in Bali, Indonesia

[download pdf]

Ensemble clustering for result diversification

by Dong Nguyen and Djoerd Hiemstra

This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run.

[download pdf]

Mark Kazemier graduates on social networks for primary education teachers

Integrating a social network into an administration system for primary education

by Mark Kazemier

Research of the Dutch educational inspectorate shows that there are still many problems within Dutch primary education (Inspectie van het onderwijs, 2010). Topicus creates a pupil administration system ParnasSys that tries to solve these problems for the primary education. Two of these problems are not solved by ParnasSys however. Teachers are uncertified and teaching material is often bad. With the recent increase in popularity of social networks, Topicus sees opportunities. This study shows a social network should be integrated into ParnasSys as a stand-alone application. This means that when users log-in to ParnasSys they get a new option to go to the social network, but the existing parts do not connect directly to the network.
Existing theory and implementations of social networks in education and corporations shows that social networking creates new relationships between people that otherwise would not have existed. This leads to access to more information, new experience and creation of new content. The creation of new content can help teachers to select better teaching material, enhance their current teaching material and find solutions to issues they currently have in the classroom. They can also share their own experiences with others helping other teachers increase their skills and experiences.
When integrating a social network within ParnasSys there are two issues that need to be mitigated: 1) Copyright, 2) Privacy. Copyright can easily be mitigated by automatically posting all content on the network with a creative commons attribution license. This means that everyone can use the content as long as they mention the author. When people post content to the network that is copyrighted it can be removed when a takedown notice or report is received. Privacy is a more subtle issue. While privacy controls mitigate most of the issues. Some issues subsist. For example when a teacher posts something about a pupil and the parent of this pupil is also a teacher with access to ParnasSys this could lead to issues. The only way to mitigate this issue is by educating the users that those privacy issues exist.
It is recommended to integrate a social network within ParnasSys. There are two possibilities for further research. First the research recommends to integrate the social network as a stand-alone application as start, but it is recommended to look further into possibilities to connect several existing parts of ParnasSys with the network. For example pages with information of tests could integrate with the network where several users can work together on these tests. Second, finding of information gets more important when the network gets more users. While there are no issues found on finding of information in the interviews with users, this could become an issue in the future. It is therefore recommended to test several search methods and measure how many users use these methods to find their needed information.

[download pdf]

Study tour completed

We are back from the two week China tour organized by Inter-Actief: 28 students, 4 cities (Shanghai, Hangzhou, Beijing and Hong Kong), 14 company and university visits in 14 days!

Visit at Tsinghua University

Top 3 university visits: 1) Tsingua University in Beijing with a very warm welcome by prof. Ling Feng, excellent talks and an impressive campus tour; 2) Jiao Tong University in Shanghai with interesting talks and students demoing their design challenge results; 3) Tongji University, Shanghai with interesting presentations and campus tour.

Visit at MSR Asia

Top 3 company visits: 1) Microsoft Research Asia in Beijing with a excellent welcome by Tetsuya Sakai, some really awesome tech talks and a cool tour through the lab (see team photo); 2) Philips Design, Hong Kong with interesting talks and some of us participating in an experiment; 3) MotionGlobal, Shanghai, with very inspiring talks and an 'international' office tour. Two runner ups worth mentioning: 4) Nedap in Shanghai, and 5) Alibaba in Hangzhou.

More information at: noodle2012.nl

Dutch-Belgian Database Day in Brussels

The Dutch-Belgian Database Day 2012 (DBDBD 2012) will be held in Brussels, Belgium on November 21st, 2012. The DBDBD is a yearly one-day workshop organized by a Belgian or Dutch university, whose general topic is database research. At the DBDBD, junior researchers from the Netherlands and Belgium can present their recent results, and meet senior researchers in the field of databases. DBDBD invites submissions (1 page abstract) on a broad range of database and database-related topics.

More information at: http://dbdbd.be.

Almer Tigelaar defends PhD thesis on P2P Search

Peer-to-peer information retrieval

by Almer Tigelaar

The Internet has become an integral part of our daily lives. However, the essential task of finding information is dominated by a handful of large centralised search engines. In this thesis we study an alternative to this approach. Instead of using large data centres, we propose using the machines that we all use every day: our desktop, laptop and tablet computers, to build a peer-to-peer web search engine. We provide a definition of the associated research field: peer-to-peer information retrieval. We examine what separates it from related fields, give an overview of the work done so far and provide an economic perspective on peer-to-peer search. Furthermore, we introduce our own architecture for peer-to-peer search systems, inspired by BitTorrent. Distributing the task of providing search results for queries introduces the problem of query routing: a query needs to be send to a peer that can provide relevant search results. We investigate how the content of peers can be represented so that queries can be directed to the best ones in terms of relevance. While cooperative peers can provide their own representation, the content of uncooperative peers can be accessed only through a search interface and thus they can not actively provide a description of themselves. We look into representing these uncooperative peers by probing their search interface to construct a representation. Finally, the capacity of the machines in peer-to-peer networks differs considerably making it challenging to provide search results quickly. To address this, we present an approach where copies of search results for previous queries are retained at peers and used to serve future requests and show participation can be incentivised using reputations. There are still problems to be solved before a real-world peer-to-peer web search engine can be build. This thesis provides a starting point for this ambitious goal and also provides a solid basis for reasoning about peer-to-peer information retrieval systems in general.

[download pdf]

OLC-IT Jaarverslag 2011-2012

De opleidingscommissie IT (OLC-IT) houdt zich bezig met examenregelingen en het onderwijs­programma van de bacheloropleidingen Technische Informatica en Telematica en de master­opleidingen Computer Science, en Telematics. Ze heeft wettelijk het recht om gevraagd en ongevraagd advies uit te brengen aan de opleidingsdirecteur en de decaan. Elk jaar maakt de OLC een jaarverslag. Dit jaar in het jaarverslag:

  • Curriculumwijzigingen
  • Kwaliteitszorg
  • Universiteitsbrede OER
  • Studieversnellende maatregelen
  • Twents onderwijsmodel
  • Interactie student en docent

Lees het hele jaarverslag 2011-2012.

What snippets say about web pages

What Snippets Say About Pages in Federated Web Search

by Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Chris Develder, and Djoerd Hiemstra

What is the likelihood that a Web page is considered relevant to a query, given the relevance assessment of the corresponding snippet? Using a new federated IR test collection that contains search results from over a hundred search engines on the internet, we are able to investigate such research questions from a global perspective. Our test collection covers the main Web search engines like Google, Yahoo!, and Bing, as well as a number of smaller search engines dedicated to multimedia, shopping, etc., and as such reflects a realistic Web environment. Using a large set of relevance assessments, we are able to investigate the connection between snippet quality and page relevance. The dataset is strongly inhomogeneous, and although the assessors’ consistency is shown to be satisfying, care is required when comparing resources. To this end, a number of probabilistic quantities, based on snippet and page relevance, are introduced and evaluated.

The paper will be presented at the Asia Information Retrieval Societies Conference AIRS 2012 in Tianjin, China

[download pdf]

CLEF 2012 Proceedings on-line

CLEF 2012 The proceedings of the Conference and Labs of the Evaluation Forum (CLEF 2012) are on-line at Springer, titled: “Information Access Evaluation meets Multilinguality, Multimodality, and Visual Analytics”. CLEF will take place from 17 to 20 September in Rome, Italy. The proceedings contain 14 full papers, 3 short papers, and 2 keynote papers to be presented at the conference in Rome. The papers are organized in three topical sections: benchmarking and evaluation initiatives; information access; and evaluation methodologies and infrastructure.

See: CLEF 2012 proceedings.