Cross-Domain Authorship Attribution as a Tool for Digital Investigations
by Christel Geurts
On the darkweb sites promoting illegal content are abundant and new sites are constantly created. At the same time Law Enforcement is working hard to take these sites down and track down the persons involved. Often, after taking down a site, users change their name and move to a different site. But what if Law Enforcement could track users across sites? Different sites or sources of information are called a domain. As the domain changes, often the context of a message also changes, making it challenging to track users simply on words used. The aim of this thesis is to develop a system that can link written text of authors in a cross-domain setting. The system was tested on a blog corpus and verified on police data. Tests show that multinomial logistic regression and Support Vector Machines with a linear kernel perform well. Character 3-grams work well as features, combining multiple feature sets increases performance. Tests show that Logistic Regression models with a combined feature set performed best (accuracy = 0.717, MRR = 0.7785, 1000 authors (blog corpus)). On the police data the Logistic Regression model had an accuracy of 0.612 and a MRR of 0.6883 for 521 authors.
by Slavica Zivanovic
There is an ongoing discussion about the applicability of social media data in scientific research. Moreover, little is known about the feasibility to use these data to capture the Quality of Life (QoL). This study explores the use of social media in QoL research by capturing and analysing people’s perceptions about their QoL using Twitter messages. The methodology is based on a mixed method approach, combining manual coding of the messages, automated classification, and spatial analysis. The city of Bristol is used as a case study, with a dataset containing 1,374,706 geotagged Tweets sent within the city boundaries in 2013. Based on the manual coding results, health, transport, and environment domains were selected to be further analysed. Results show the difference between Bristol wards in number and type of QoL perceptions in every domain, spatial distribution of positive and negative perceptions, and differences between the domains. Furthermore, results from this study are compared to the official QoL survey results from Bristol, statistically and spatially. Overall, three main conclusions are underlined. First, Twitter data can be used to evaluate QoL. Second, based on people’s opinions, there is a difference in QoL between Bristol neighbourhoods. And, third, Twitter messages can be used to complement QoL surveys but not as a proxy. The main contribution of this study is in recognising the potential Twitter data have in QoL research. This potential lies in producing additional knowledge about QoL that can be placed in a planning context and effectively used to improve the decision-making process and enhance quality of life of residents.
by Marco Schultewolter
Often, software providers ask users to insert personal data in order to grant them the right to use their software. These companies want the user profile as correct as possible, but users sometimes tend to enter incorrect information. This thesis researches and discusses approaches to automatically verify this information using third-party web resources.
Therefore, a series of experiments is done. One experiment compares different similarity measures in the context of a German phone book directory for again different search approaches. Another experiment takes the approach to use a search engine without a specific predefined data source. Ways of finding persons in search engines and of extracting address information from unknown websites are compared in order to do so.
It is shown, that automatic verification can be done to some extent. The verification of name and address data using external web resources can support the decision with Jaro-Winkler as similarity measure, but it is still not solid enough to only rely on it. Extracting address information from unknown pages is very reliable when using a sophisticated regular expression. Finding persons on the internet should be done by using just the full name without any additions.
21-22 January 2016
University of Twente
If you're interested in social media analysis and/or computational social science, there will be interesting guest speakers, including speakers from UCLA, TNO, TU Delft, Greenpeace, Sanquin, and Twitter.
Check out the Jupyter IPython Notebook Exercises made for the module Web Science. The exercises closely follow the exercises from Chapter 13 and 14 of the wonderful Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. Download the notebooks here:
Update (February 2016). The notebooks with answers are now available below:
Estimating Creditworthiness using Uncertain Online Data
by Maurice Bolhuis
The rules for credit lenders have become stricter since the financial crisis of 2007-2008. As a consequence, it has become more difficult for companies to obtain a loan. Many people and companies leave a trail of information about themselves on the Internet. Searching and extracting this information is accompanied with uncertainty. In this research, we study whether this uncertain online information can be used as an alternative or extra indicator for estimating a company’s creditworthiness and how accounting for information uncertainty impacts the prediction performance.
A data set consisting 3579 corporate ratings has been constructed using the data of an external data provider. Based on the results of a survey, a literature study and information availability tests, LinkedIn accounts of company owners, corporate Twitter accounts and corporate Facebook accounts were chosen as an information source for extracting indicators. In total, the Twitter and Facebook accounts of 387 companies and 436 corresponding LinkedIn owner accounts of this data set were manually searched. Information was harvested from these sources and several indicators have been derived from the harvested information.
Two experiments were performed with this data. In the first experiment, a Naive Bayes, J48, Random Forest and Support Vector Machine classifier was trained and tested using solely these Internet features. A comparison of their accuracy to the 31% accuracy of the ZeroR classifier, which as a rule always predicts the most occurring target class, showed that none of the models performed statistically better. In a second experiment, it was tested whether combining Internet features with financial data increases the accuracy. A financial data mining model was created that approximates the rating model of the ratings in our data set and that uses the same financial data as the rating model. The two best performing financial models were built using the Random Forest and J48 classifiers with an accuracy of 68% and 63% respectively. Adding Internet features to these models gave mixed results with a significant decrease and an insignificant increase respectively.
An experimental setup for testing how incorporating uncertainty affects the prediction accuracy of our model is explained. As part of this setup, a search system is described to find candidate results of online information related to a subject and to classify the degree of uncertainty of this online information. It is illustrated how uncertainty can be incorporated into the data mining process.
We are proud to announce the 12th Seminar on Searching and Ranking, with guest presentations by Ingo Frommholz from the University of Bedfordshire, UK, and Tom Heskes from Radboud University Nijmegen, the Netherlands.
More information at: SSR 12.
The Influence of Prosocial Norms and Online Network Structure on Prosocial Behavior: An Analysis of Movember’s Twitter Campaign in 24 Countries
by Tijs van den Broek, Ariana Need, Michel Ehrenhard, Anna Priante and Djoerd Hiemstra
Sociological research points at norms and social networks as antecedents of prosocial behavior. To date, the literature remains undecided on how these factors jointly influence prosocial behavior. Furthermore, the use of social media by campaign organizations may change the need for formal networks to organize large-scale collective action. Hence, in this paper we examine the interplay of prosocial norms and the structure of online social networks on offline prosocial behavior. For this purpose we use donation data from the global Movember campaign, messages about the Movember campaign on the online social networking site Twitter, and data from the World Giving Index. A multi-level analysis of Movember’s campaigns in 24 countries finds support for the logic of connective action: larger and more decentralized networks raise more donations. Furthermore, we find that the effect of prosocial norms on donations is decreased by larger and denser campaign networks.
To be presented at Social media, Activism, and Organizations 2015 (SMAO) on 6 November in Londen, UK.
Recommendations using DBpedia: How your Facebook profile can be used to find your next greeting card
by Anne van de Venis
Recommender systems (RS) are systems that provide suggestions that users may find interesting. In this thesis we present our Interest-Based Recommender System (IBRS) that can recommend tagged item sets from any domain. This RS is validated with item sets from two different domains, namely postcards and holidays homes. While postcards and holiday homes are very different items, with different characteristics, IBRS uses the same recommender engine to create recommendations. IBRS solves several problems that are present in classic RSs, such as the cold-start problem and language independence. The cold-start problem for new users, is solved by using Facebook likes for creating a user profile. It uses information in DBpedia to create recommendations in a tag-based item set for multiple domains, independent of the language. Using both external knowledge sources and user content, makes our system a hybrid of a knowledge-based and content-based RS. We validated our system through an online evaluation system in two evaluation rounds with test user groups of approximately 71 and 44 people. The main contributions in this thesis are:
- a literature study of existing recommendation approaches;
- a language-independent mapping approach for tags and social media resource onto DBpedia resources;
- a domain-independent algorithm for detecting related concepts in the DBpedia graph;
- a recommendation approach based on both Facebook and DBpedia;
- a validation of our recommendation approach.
Optimizing Travel Destinations Based on User Preferences
by Julia Kiseleva (TU Eindhoven), Melanie Müller (Booking.com), Lucas Bernardi (Booking.com), Chad Davis (Booking.com), Ivan Kovacek (Booking.com), Mats Stafseng Einarsen (Booking.com), Jaap Kamps (University of Amsterdam), Alexander Tuzhilin (New York University), Djoerd Hiemstra
Recommendation based on user preferences is a common task for e-commerce websites. New recommendation algorithms are often evaluated by offline comparison to baseline algorithms such as recommending random or the most popular items. Here, we investigate how these algorithms themselves perform and compare to the operational production system in large scale online experiments in a real-world application. Specifically, we focus on recommending travel destinations at Booking.com, a major online travel site, to users searching for their preferred vacation activities. To build ranking models we use multi-criteria rating data provided by previous users after their stay at a destination. We implement three methods and compare them to the current baseline in Booking.com: random, most popular, and Naive Bayes. Our general conclusion is that, in an online A/B test with live users, our Naive-Bayes based ranker increased user engagement significantly over the current online system.
To be presented at SIGIR 2015, the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, on 12 August in Santiago de Chile.