Tips and additional information for Assignment 3

The deadline for assignment 3 is Friday 4 December 10.45 h. (start of lecture). Some tips for Assignment 3:

  • To run the example code for regular expression matching in Haskell you need to import Text.Regex and Data.Maybe

  • Assignment 3.4: Tip: calculate some hash value over the complete web site content. Two duplicates will receive the exact same hash value, but because of collisions two different pages might get the same hash value. After computing the hash, you have to do a final check, removing duplicates from pages with the same hash value.
  • As an example of the result of the sample stage of Assignment 3.5, consider sorting people by their length on three machines. The sample stage would set boundaries on the values that approximately divide the data in three equal parts, for instance:
    • values between 0 and 1,75: part 1
    • values between 1,75 and 1,80: part 2
    • values between 1,80 and infinity: part 3
(You might get this if the sampling stage reveals that about 1/3 of persons is small than 1.75m, 1/3 is between 1.75 and 1.80 tall, and 1/3 is bigger than 1.80m)

Note that actual implementation in Hadoop needs a user-defined "partitioner", but for the Haskell assignment this is unimportant.

Finally, for the next lecture, please think of what problem you want to solve with Hadoop for Assignment 4.

More info on Blackboard.