Distributed data processing using MapReduce

Distributed data processing using MapReduce is a new course that teaches how to carry out large-scale distributed data analysis using Google's MapReduce as the programming abstraction. MapReduce is a programming abstraction that is inspired by the functions 'map' and 'reduce' as found in functional programming language such as Lisp. It was developed at Google as a mechanism to allow large-scale distributed processing of data on data centers consisting of thousands of low-cost machines. MapReduce allows programmers to distribute their programs over many machines without the need to worry about system failures, threads, locks, semaphores, and other concepts from concurrent and distributed programming. Students will learn to specify algorithms using map and reduce steps and to implement these algorithms using Hadoop, an open source implementation of Google's file system and MapReduce. The course will introduce recent attempts to develop high-level languages for simplified relational data processing on top of Hadoop, such as Yahoo's Pig Latin and Microsoft's DryadLINQ.

The course consists of lectures and practical assignments. Students will solve lab exercises on a large cluster of machines in order to get hands-on experience and solve real large-scale problems. The lab exercises will be done on the University of Twente PRISMA-2 computer, a data center consisting of 16 dual core systems sponsored by Yahoo Research. Examples of lab exercises are: counting bigrams in large web crawls, inverted index construction, and the computation of Google's PageRank. After successful completion of the course, the student is able to:

  • Disect complex problems in algorithms that use map and reduce steps,
  • Specify these algorithms in a functional language such as Haskell,
  • Implement these algorithms using the Hadoop framework,
  • Specify simplified relational queries using Pig Latin.

More information at Blackboard.