What’s exactly ‘big’ about big data? Lecture for the “Universiteit van Nederland” (in Dutch).
What’s exactly ‘big’ about big data? Lecture for the “Universiteit van Nederland” (in Dutch).
by Jeroen Vonk
Within the field of Computer Science a lot of previous and current research is done on model checking. Model checking allows researchers to simulate a process or system, and exhaustively test for wanted or non-wanted properties. Logically, the result of these test are as dependable as your model represents the actual system. The best model then, would be a model representing the system down to its last atom, allowing for every possible interaction with the model. The model of course will become extremely large, a situation known as state space explosion. Current research therefore focuses on:
In this thesis we will focus on reducing the size of the models using bisimulation reduction. Bisimulation reduction allows to identify similar states that can be merged whilst preserving certain properties of the model. These similar, or redundant states will be identified by comparing them with other states in the model using a bisimulation relation. The bisimulation relation will identify states showing the same behavior, that therefore can be merged. This process is called bisimulation reduction. A common method to determine the smallest model is using partition refinement. In order to use the algorithm on large models it needs to be scalable. Therefore we will be using a framework for distributed processing that is part of Hadoop, called MapReduce. Using this framework provides us with a robust system that automatically recovers from e.g. hardware faults. The use of MapReduce also makes our algorithm scalable, and easily executed at third party clusters.
During our experiments we saw that the execution-time for a MapReduce job takes a relatively long time. We have estimated that there is a startup cost for each job of circa 30 seconds. This means that the reduction of transition systems that need a lot of iterations can be very high. Extreme cases such as the vasy 40 60 which take over 20,000 iterations therefore could not be benchmarked within an acceptable time-frame. Each iteration all of our data is passed over the disk. Therefore it is not unreasonable to see a factor 10-100 slow down compared to a mpi-based implementation (e.g. LTSmin). From our experiments we have concluded that the separate iteration times of our algorithm scale linearly up to 108 transitions for strong bisimulation and 107 for branching bisimulation. On larger models the iteration time increases exponentially, therefore we where not able to benchmark our largest model.
Eenvoudige modellen en Big Data troeven slimme modellen af
Big Data – of het beter allitererende “Grote Gegevens” – is een term die sinds het begin van deze eeuw wordt gebruikt om gegevensverzamelingen aan te duiden die moeilijk verwerkt konden worden met behulp van de software van die tijd, verzamelingen van vele terabytes of petabytes in grootte. Technieken om zulke enorme verzamelingen gegevens te kunnen verwerken en analyseren werden met name ontwikkeld door Google. Het uitgangspunt van Google: Zet heel veel goedkope machines bij elkaar in grote datacentra, en gebruik slimme gereedschappen zodat applicatieontwikkelaars en gegevensanalisten het hele datacentrum kunnen gebruiken voor hun gegevensanalyses. Het datacentrum is de nieuwe computer! De slimme gereedschappen van Google raken veel kernelementen van de Informatica: bestandssystemen (Google File System), nieuwe programmeerparadigma's (MapReduce), nieuwe programmeertalen (bijvoorbeeld Sawzall) en nieuwe aanpakken voor het beheren van gegevens (BigTable), allemaal ontwikkeld om grote gegevensverzamelingen gemakkelijk toegankelijk te maken. Deze technieken zijn inmiddels ook beschikbaar in open source varianten. De bekendste, Hadoop, werd voor een belangrijk deel ontwikkeld bij Googles concurrent Yahoo. Aan de Universiteit Twente worden de technieken sinds 2009 onderwezen in het masterprogramma Computer Science. Nu we in staat zijn om te trainen op grootschalige gegevensverzamelingen doet zich het volgende fenomeen voor: Eenvoudige modellen getraind met grote gegevens troeven complexe modellen op basis van minder gegevens af…
Verschenen in STAtOR 14(3-4), Vereniging voor Statistiek en Operationele Research
Scalable performance for a forensic database application
by Mattijs Ugen
As digital forensic investigations deal with more and more data, the Netherlands Forensic Institute, NFI, foresees scalability issues with the current solution in the near future. Following the global trend towards distributed solutions for 'Big data' problems, the NFI wants to find a suitable architecture to replace the currently used XIRAF system. Using experimental implementations on top of a selection of distributed data stores, we present query performance timings in three different scaling dimensions: cluster size, working set size and the amount of parallel clients. We present that scaling characteristics for parallel clients show a linear trend, but proves hard to measure for the other dimensions. A distributed search engine architecture proves the best candidate for the NFI, warranting closer investigation in that area for a real-world deployment.
Welcome to the new course Managing Big Data. We will closely follow developments to manage huge amounts of data on large clusters of commodity machines, initiated by Google, and followed by many other web companies such as Yahoo, Amazon, AOL, Facebook, Hyves, Spotify, Twitter, etc. Big data gives rise to a redesign of many core computer science concepts: We will discusses file systems (Google FS), programming paradigms (MapReduce), programming languages and query languages (for instance Sawzall and Pig Latin), and 'noSQL' database paradigms (for instance BigTable and Dynamo) for managing big data. The first lecture is next Friday, 16 November at 10.45 h. in RA 2502.
More information on blackboard. (access restricted, sorry our university does not like me to share courses 🙁 )