Jeroen Vonk graduates on Bisimulation reduction with MapReduce

by Jeroen Vonk

Within the field of Computer Science a lot of previous and current research is done on model checking. Model checking allows researchers to simulate a process or system, and exhaustively test for wanted or non-wanted properties. Logically, the result of these test are as dependable as your model represents the actual system. The best model then, would be a model representing the system down to its last atom, allowing for every possible interaction with the model. The model of course will become extremely large, a situation known as state space explosion. Current research therefore focuses on:

  • Storing larger models
  • Processing large models faster and smarter
  • Reducing the size of models, whilst keeping the same properties

In this thesis we will focus on reducing the size of the models using bisimulation reduction. Bisimulation reduction allows to identify similar states that can be merged whilst preserving certain properties of the model. These similar, or redundant states will be identified by comparing them with other states in the model using a bisimulation relation. The bisimulation relation will identify states showing the same behavior, that therefore can be merged. This process is called bisimulation reduction. A common method to determine the smallest model is using partition refinement. In order to use the algorithm on large models it needs to be scalable. Therefore we will be using a framework for distributed processing that is part of Hadoop, called MapReduce. Using this framework provides us with a robust system that automatically recovers from e.g. hardware faults. The use of MapReduce also makes our algorithm scalable, and easily executed at third party clusters.
During our experiments we saw that the execution-time for a MapReduce job takes a relatively long time. We have estimated that there is a startup cost for each job of circa 30 seconds. This means that the reduction of transition systems that need a lot of iterations can be very high. Extreme cases such as the vasy 40 60 which take over 20,000 iterations therefore could not be benchmarked within an acceptable time-frame. Each iteration all of our data is passed over the disk. Therefore it is not unreasonable to see a factor 10-100 slow down compared to a mpi-based implementation (e.g. LTSmin). From our experiments we have concluded that the separate iteration times of our algorithm scale linearly up to 108 transitions for strong bisimulation and 107 for branching bisimulation. On larger models the iteration time increases exponentially, therefore we where not able to benchmark our largest model.

[download pdf]

Solving the Continuous Cold Start Problem in E-commerce Recommendations

Beyond Movie Recommendations: Solving the Continuous Cold Start Problem in E-commerce Recommendations

by Julia Kiseleva, Alexander Tuzhilin, Jaap Kamps, Melanie Mueller, Lucas Bernardi, Chad Davis, Ivan Kovacek, Mats Stafseng Einarsen, Djoerd Hiemstra

Many e-commerce websites use recommender systems or personalized rankers to personalize search results based on their previous interactions. However, a large fraction of users has no prior interactions, making it impossible to use collaborative filtering or rely on user history for personalization. Even the most active users may visit only a few times a year and may have volatile needs or different personas, making their personal history a sparse and noisy signal at best. This paper investigates how, when we cannot rely on the user history, the large scale availability of other user interactions still allows us to build meaningful profiles from the contextual data and whether such contextual profiles are useful to customize the ranking, exemplified by data from a major online travel agent Booking.com.
Our main findings are threefold: First, we characterize the Continuous Cold Start Problem (CoCoS) from the viewpoint of typical e-commerce applications. Second, as explicit situational context is not available in typical real world applications, implicit cues from transaction logs used at scale can capture essential features of situational context. Third, contextual user profiles can be created offline, resulting in a set of smaller models compared to a single huge non-contextual model, making contextual ranking available with negligible CPU and memory footprint. Finally we conclude that, in an online A/B test on live users, our contextual ranker increased user engagement substantially over a non-contextual baseline, with click-through-rate (CTR) increased by 20%. This clearly demonstrates the value of contextual user profiles in a real world application.

[download pdf]

Marco Schultewolter graduates on Verification of User Information

by Marco Schultewolter

Often, software providers ask users to insert personal data in order to grant them the right to use their software. These companies want the user profile as correct as possible, but users sometimes tend to enter incorrect information. This thesis researches and discusses approaches to automatically verify this information using third-party web resources.
Therefore, a series of experiments is done. One experiment compares different similarity measures in the context of a German phone book directory for again different search approaches. Another experiment takes the approach to use a search engine without a specific predefined data source. Ways of finding persons in search engines and of extracting address information from unknown websites are compared in order to do so.
It is shown, that automatic verification can be done to some extent. The verification of name and address data using external web resources can support the decision with Jaro-Winkler as similarity measure, but it is still not solid enough to only rely on it. Extracting address information from unknown pages is very reliable when using a sophisticated regular expression. Finding persons on the internet should be done by using just the full name without any additions.

[download pdf]

Steven Verkuil graduates on Reference Extraction Techniques

Journal Citation Statistics for Library Collections using Document Reference Extraction Techniques

by Steven Verkuil

Providing access to journals often comes with a considerable subscription fee for universities. It is not always clear how these journal subscriptions actually contribute to ongoing research. This thesis provides a multistage process for evaluating which journals are actively referenced in publications. Our software tool for journal citation reports, CiteRep, is designed to aid decision making processes by providing statistics about the number of times a journal is referenced in a document set. Citation reports are automatically generated from online repositories containing PDF documents. The process of extracting citations and identifying journals is user and maintenance friendly. CiteRep allows to filter generated reports by year, faculty and study providing detailed insight in journal usage for specific user groups. Our software tool achieves an overall weighted precision and recall of 66,2% when identifying journals in a fresh set of PDF documents. While leaving open some areas of improvement, CiteRep outperforms the two most popular citation parsing libraries, ParsCit and FreeCite with respect to journal identification accuracy. CiteRep should be considered for creation of journal citation reports from document repositories.

[download pdf]

Clone CiteRep on Github.

Mohammad Khelghati defends PhD thesis on Deep Web Entity Monitoring

by Mohammadreza Khelghati

Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist or a business analyst, your primary concern is to access all the relevant data to your topics of interest. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data. In accessing web data through either general search engines or direct querying of deep web sources, the laborious work of querying, navigating results, downloading, storing and tracking data changes is a burden on shoulders of users. To decrease this intensive labor work of accessing data, (semi-)automatic harvesters have a crucial role. However, they lack a number of functionalities that we discuss and address in this work.
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles. First, we try to find methods that can achieve the best coverage in harvesting data for a topic. Although using a fully automatic general harvester facilitates accessing web data, it is not a complete solution to collect a thorough data coverage on a given topic. Some search engines, in both surface web and deep web, restrict the number of requests from a user or limit the number of returned results presented to him. We suggest an efficient approach which can pass these limitations and achieve a complete data coverage.
Second, we investigate reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size. Harvesting tasks continue till they face the posed query submission limitations by search engines or consume all the allocated resources. To prevent this undesirable situation, we need to know the size of the targeted source. For a website that hides the true size of its residing data, we suggest an accurate method to estimate its size.
As the third challenge, we focus on monitoring data changes over time in web data repositories. This information is helpful in providing the most up-to-date answers to information needs of users. The fast evolving web adds extra challenges for having an up-to-date data collection. Considering the costly process of harvesting, it is important to find methods which facilitate efficient re-harvesting processes.
Lastly, we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings.
These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.

[download pdf]

A new search engine for the university

As of this today, the university is using our Distributed Search approach as their main search engine on: http://utwente.nl/search (and also stand-alone on https://search.utwente.nl). The UT search engine offers its user not only the results from a large web crawl, but also live results from many sources that were previously invisible, such as courses, timetables, staff contact information, publications, the local photo database “Beeldbank”, vacancies, etc. The search engine combines about 30 of such sources, and learns over time which sources should be included for a query, even if it has never seen that query, nor the results for the query.

University of Twente

Read more in the official announcement (in Dutch).

Efficient Web Harvesting Strategies for Monitoring Deep Web Content

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

Focused Web Harvesting aims at achieving a complete harvest of a set of related web data for a given topic. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.

[download pdf]

The software for this work is available as: HaverstED.

3TU NIRICT theme Data Science

The main objective of the NIRICT research in Data Science is to study the science and technology to unlock the intelligence that is hidden inside Big Data.
The amounts of data that information systems are working with are rapidly increasing. The explosion of data happens in a pace that is unprecedented and in our networked world of today the trend is even accelerating. Companies have transactional data with trillions of bytes of information about their customers, suppliers and operations. Sensors in smart devices generate unparalleled amounts of sensor data. Social media sites and mobile phones have allowed billions of individuals globally to create their own enormous trails of data.
The driving force behind this data explosion is the networked world we live in, where information systems, organizations that employ them, people that use them, and processes that they support are connected and integrated, together with the data contained in those systems.

What happens in an internet minute in 2016?

Unlocking the Hidden Intelligence

Data alone is just a commodity, it is Data Science that converts big data into knowledge and insights. Intelligence is hidden in all sorts of data and data systems.
Data in information systems is usually created and generated for specific purposes: it is mostly designed to support operational processes within organizations. However, as a by-product, such event data provide an enormous source of hidden intelligence about what is happening, but organizations can only capitalize on that intelligence if they are able to extract it and transform the intelligence into novel services.
Analyzing the data provides opportunities for organizations to gather intelligence to capitalize historic and current performance of their processes and exploit future chances for performance improvement.
Another rich source of information and insights is data from the Social Web. Analyzing Social Web Data provides governments, society and companies with better understanding of their community and knowledge about human behavior and preferences.
Each 3TU institute has its own Data Science program, where local data science expertise is bundled and connected to real-world challenges.

Delft Data Science (DDS) – TU Delft
Scientific director: Prof. Geert-Jan Houben

Data Science Center Eindhoven (DSC/e) – TU/e
Scientific director: Prof. Wil van der Aalst

Data Science Center UTwente (DSC UT) – UT
Scientific director: Dr. Djoerd Hiemstra

More information at: https://www.3tu.nl/nirict/en/Research/data-science/.