2013’s DB colloquium – Djoerd Hiemstra

Below you find a list of last year's DB colloquia, usually Tuesday's from 13:45h. – 14:30h. in ZI-3126.

15 January 2013

Zhemin Zhu
Separate Training for Conditional Random Fields Using Co-occurrence Rate Factorization
The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. In this paper, we present separate training for undirected models based on the novel Co-occurrence Rate Factorization (CR-F). Separate training is a local training method. In contrast to piecewise training, separate training is exact. In contrast to MEMMs, separate training is unaffected by the label bias problem. Experiments show that separate training (i) is unaffected by the label bias problem; (ii) reduces the training time from weeks to seconds; and (iii) obtains competitive results to the standard and piecewise training on linear-chain CRFs.

18 January 2013 (Friday from 13.45h. to 15.30h. in Dutch)

Edo Drenthen (Avanade)
Big Data & Hadoop
BigData staat voor data die zo groot en complex is dat ze moeilijk zijn op te slaan en te bewerken met traditionele database management tools. Big Datasets verschillen van traditionele datasets over 3 assen: 1) Volume: We zijn meer dan ook in staat om data te genereren en op te slaan. Dit heeft tot gevolg dat datasets in orde en factor groter worden; 2) Variety: Data analyze is nodig over een combinatie van gestructureerd (bijv RDBMS) en ongestructureerd (bijv logs, tweets); 3) Velocity: Data wordt steeds sneller gegenereerd, bijvoorbeeld door mobile devices, sensor netwerken, logs, streaming systems. Het onderscheid tussen relevante en en niet relevante data is hierdoor voor bedrijven steeds moeilijker te maken. Organisaties staan voor de uitdaging om de juiste informatie uit Big Data te halen en deze om te zetten naar waarde. Afhankelijk van o.a doelstellingen en de maturity van een organisatie zal een strategie bedacht moeten worden om dit te bewerkstelligen. Avanade helpt haar klanten bij het beantwoorden van deze vragen. Deze presentatie zal gaan over BigData integratie en Hadoop op het Microsoft platform. Verschillende aspecten van BigData uiteengezet zullen worden naar aanleiding van een business-case en toegelicht worden door middel van een demo. Voor BI, Datawarehousing en data mining enthousiastelingen is dit een presentatie die je zeker niet mag missen!

Jasper Stoop
Process Mining and Fraud Detection:
A case study on the theoretical and practical value of using process mining for the detection of fraudulent behavior in the procurement process
The presentation is about how process mining can be utilized in fraud detection and what the benefits of using process mining for fraud detection are. The concepts of fraud and fraud detection are discussed. These results are combined with an analysis of existing case studies on the application of process mining and fraud detection to construct an initial setup of two case studies, in which process mining is applied to detect possible fraudulent behavior in the procurement process. Based on the experiences and results of these case studies, the 1+5+1 methodology is presented as a first step towards operationalizing principles with advice on how process mining techniques can be used in practice when trying to detect fraud.

23 January 2013 (Wednesday)

Ander de Keijzer (Windesheim University of Applied Sciences)
Securing Data
With computers everywhere and pretty much always online and/or accessible, securing data is of the utmost importance. In this presentation we highlight several points of interest when thinking of security and suggest some possible (and practical) solutions.

5 February 2013 (15:45h. – 17.30h. in Waaier 1&2)

Ed Brinksma
Try out Euclid
Prof. Brinksma has prepared an amazing lecture for next year's bachelor programme. He will take you to exciting corners of mathematics where you have never been before. But, above all, he will challenge you with some remarkable brain teasers. Solutions may be submitted live through your smartphone or tablet. Among the ten best solutions we will allot nine Spotify Gift Cards. The first prize will be a surprise.

5 March 2013

Robin Aly
A Unified Framework to Evaluate Plurality Oriented Retrieval
A standard way to evaluate search engines is the Cranfield paradigm. The paradigm assumes well-defined information needs, golden standard relevance assessments, and an evaluation measure on graded relevance. The literature believes that the Cranfield paradigm is a too poor model of reality and that better paradigms should include phenomena such as the following: first, users issue ambiguous queries, second, information needs require diverse documents, and finally, relevance assessments could be erroneous. Until now, each approach in the literature addresses only a subset of these phenomena. We propose, however, that these phenomena will almost always co-occur and an search evaluation approach should account for this. In this talk I present a unified framework of approaches from the literature that has exactly one component for each phenomena. Using the framework, we investigate how the comparison of existing search engines changes under a simulated mixture of these phenomena.

18 March 2013 (Monday at 15.45 h. in the SmartXP lab)

Peter Norvig (Google)
Norvig Web Data Science Award Ceremony
In a live video connection from California, USA, Peter Norvig, Director of Research at Google, will award the winners of the Web Data Science Award that was named in his honor: Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente. The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science.
Read more…

26 March 2013

Vincent van Donselaar
TF×IDF ranking: a physical approach
This paper tries to find similarities and differences between TFxIDF ranking and the theory of network analysis. A simplified network model based on the principle of an electrical circuit acts as a guide to gain understanding of the model’s operation. The correctness of this model is tested by implementing it as a function of the Terrier Information Retrieval System, whereupon it is evaluated against Terrier’s predefined TF×IDF model.

2 April 2013

COMMIT day
This day is a follow up to the successful COMMIT Kick-Off meeting of last year. COMMIT/ted to you! is meant only for people who are officially participating in the COMMIT program.
Read more

9 April 2013

Paul Stapersma
Efficient Query Evaluation on Probabilistic XML Data
In this thesis, we instruct MayBMS to cope with probabilistic XML (P-XML) in order to evaluate XPath queries on P-XML data as SQL queries on uncertain relational data. This approach entails two aspects: (1) a data mapping from P-XML to U-Rel that ensures that the same information is represented by database instances of both data structures, and (2) a query mapping from XPath to SQL that ensures that the same question is specified in both query languages.

16 April 2013

Maurice van Keulen
Parels der Informatica (Pearls of Computer Science)
Maurice will talk about the plans for the new Computer Science bachelor curriculum, where student learn about the computer science's most impressive achievements: the pearls of Computer Science.

22 April 2013 (15:00 h. Atrium Ravelijn)

with Maurice van Keulen and Djoerd Hiemstra
TOM inspiration market
On Monday, 22nd April, from 15.00 – 17.00 (Atrium, Ravelijn), there will be an Inspiration market about Twente's Educational Model. You can walk in and out as you wish, there is no fixed schedule. Maurice and Djoerd will both be present at the market, talking about using digital white boards, online lectures, and challenges.
Read more (access restricted)

1 May 2013 (Wednesday)

Rezwan Huq
Inference-based Framework managing Data Provenance
Data provenance allows scientists to validate their model as well as to investigate the origin of an unexpected value. Furthermore, it can be used as a replication recipe of output data products. However, capturing provenance requires enormous effort by scientists in terms of time and training. First, they need to design the workflow of the scientific model, i.e., workflow provenance, which requires both time and training. Second, they need to capture provenance while the model is running, i.e., fine-grained data provenance. Explicit documentation of fine-grained provenance is not feasible because of the massive storage consumption by provenance data in the applications where data is continuously arriving and is processed. We propose an inference-based framework which provides both workflow and fine-grained data provenance at a minimal cost in terms of time, training and disk consumption. Our framework is applicable to any given scientific model and is capable of handling different system dynamics such as variation in the processing time as well as input data products arrival pattern. We evaluate the framework on two different use cases. Our evaluation shows that the proposed framework can infer accurate provenance information at reduced costs and therefore, is relevant and suitable for scientists in different domains.

27 May 2013 (Monday)

Mena Habib
University of Twente at #MSM2013
Twitter messages are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. In this paper we present a hybrid approach for Named Entity Extraction (NEE) and Classification (NEC) for tweets. The system uses the power of the Conditional Random Fields (CRF) and the Support Vector Machines (SVM) in a hybrid way to achieve better results. For named entity type classification we used the AIDA disambiguation system to disambiguate the extracted named entities and hence find their type.

4 June 2013 (10.00h. in Waaier 4)

CTIT Symposium on Big Data
The CTIT Symposium Big Data and the emergence of Data Science addresses the multidisciplinary opportunities and challenges of Big Data from the perspectives of industrial research, academic research, and education.
Read more…

10 June 2013 (Monday)

Dan Ionita and Niek Tax
An API-based Search System for One Click Access to Information
This paper proposes a prototype One Click access system, based on previous work in the field and the related 1CLICK-2@NTCIR10 task. The proposed solution integrates methods from previous such attempts into a three tier algorithm: query categorization, information extraction and output generation and offers suggestions on how each of these can be implemented. Finally, a thorough user-based evaluation concludes that such an information retrieval system outperforms the textual preview collected from Google search results, based on a paired sign test. Based on validation results possible suggestions on future improvements are proposed.

18 June 2013

Design projects
Online leeromgeving voor derde wereld landen
We will present the result of our ‘Ontwerpproject’. The last semester, we have built a system that will help students in Third World Countries with their Master Thesis. These students will graduate at their own university, but can use our system to get in touch with supervisors from around the world to help them with their research. This way, the limitations on resources and expertises in those areas are dealt with to make sure the student can reach out to their potential. During this presentation, we will demonstrate the system and discuss some techniques we used to accomplish our goals.
Science Challenges
For the 'Ontwerpproject' we set out to build a system that will support the Challenge initiative. This initiative strives to organize and promote science challenges for students of the University of Twente. Challenges are short, engaging and fun projects – often combined with a competition – in which students can participate in teams or alone. The presentation will briefly describe the initiative, our workflow and the system that we built.

26 June 2013 (Wednesday)

Ilya Markov (Università della Svizzera italiana)
Reducing the Uncertainty in Resource Selection
The distributed retrieval process is plagued by uncertainty. Sampling, selection, merging and ranking are all based on very limited information compared to centralized retrieval. This talk will be focused on reducing the uncertainty within the resource selection phase by obtaining a number of estimates, rather than relying upon only one point estimate. Three methods for reducing uncertainty will be proposed, which will be compared against state-of-the-art baselines across three distributed retrieval testbeds. The experimental results show that the proposed methods significantly improve baselines, reduce the uncertainty and improve robustness of resource selection.

2 July 2013

Christian Wartena (Hochschule Hannover)
Distributional Similarity of Words with Different Frequencies
In this talk I will present three case studies in which we deal the frequency bias of distributional similarity. In the first study we compare context vectors of words with word distributions of documents in order to find words that can represent the content of a document. Here we explicitly require that the context vectors of the words are similar to the word distribution of the document and dissimilar to the general word distribution of the collection. In two recent studies we used distributional similarity to determine the semantic equivalence of words and short phrases. Here we eliminate the frequency bias by modeling the dependency of vector similarity on the word frequency.
We are not yet able to give a final solution to the frequency problem of distributional semantics, but the three case studies show that distributional similarity can be substantially improved when we take the frequency bias serious. Thus the topic is definitely worth investigating further

9 July 2013

Robin Aly
Taily: Shard Selection Using the Tail of Score Distributions
I propose Taily, a novel shard selection algorithm that models a query’s score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function’s features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time. Read more