Composing a more complete and relevant Twitter dataset
by Han van der Veen
Social data is widely used by many researchers. Facebook, Twitter and other social networks are producing huge amounts of social data. This social data can be used for analyzing human behavior. Social datasets are typically created by a hashtag, however not all relevant data includes the hashtag. A better overview can be constructed with more data. This research is focusing on creating a more complete and relevant dataset. Using additional keywords for finding more relevant tweets and a filtering mechanism to filter out the irrelevant tweets. Three additional keywords methods are proposed and evaluated. One based on word frequency, one on probability of word in a dataset and the last method is using estimates about the volume of tweets. Two classifiers are used for filtering Tweets. A Naive Bayes classifier and a Support Vector Machine classifier are compared. Our method increases the size of the dataset with 105%. The average precision was reduced from 95% of only using a hashtag to 76% for a resulting dataset. These evaluations were executed on two TV-Shows and two sport events. A tool was developed that automatically executes all parts of the program. As input a specific hashtag of an event is required and using the hash will output a more complete and relevant dataset than using the original hashtag. This is useful for social researchers that uses Tweets, but also other researchers that uses Tweets as their data.