by Marco Schultewolter
Often, software providers ask users to insert personal data in order to grant them the right to use their software. These companies want the user profile as correct as possible, but users sometimes tend to enter incorrect information. This thesis researches and discusses approaches to automatically verify this information using third-party web resources.
Therefore, a series of experiments is done. One experiment compares different similarity measures in the context of a German phone book directory for again different search approaches. Another experiment takes the approach to use a search engine without a specific predefined data source. Ways of finding persons in search engines and of extracting address information from unknown websites are compared in order to do so.
It is shown, that automatic verification can be done to some extent. The verification of name and address data using external web resources can support the decision with Jaro-Winkler as similarity measure, but it is still not solid enough to only rely on it. Extracting address information from unknown pages is very reliable when using a sophisticated regular expression. Finding persons on the internet should be done by using just the full name without any additions.
21-22 January 2016
University of Twente
If you're interested in social media analysis and/or computational social science, there will be interesting guest speakers, including speakers from UCLA, TNO, TU Delft, Greenpeace, Sanquin, and Twitter.
Check out the Jupyter IPython Notebook Exercises made for the module Web Science. The exercises closely follow the exercises from Chapter 13 and 14 of the wonderful Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. Download the notebooks here:
Update (February 2016). The notebooks with answers are now available below:
Estimating Creditworthiness using Uncertain Online Data
by Maurice Bolhuis
The rules for credit lenders have become stricter since the financial crisis of 2007-2008. As a consequence, it has become more difficult for companies to obtain a loan. Many people and companies leave a trail of information about themselves on the Internet. Searching and extracting this information is accompanied with uncertainty. In this research, we study whether this uncertain online information can be used as an alternative or extra indicator for estimating a company’s creditworthiness and how accounting for information uncertainty impacts the prediction performance.
A data set consisting 3579 corporate ratings has been constructed using the data of an external data provider. Based on the results of a survey, a literature study and information availability tests, LinkedIn accounts of company owners, corporate Twitter accounts and corporate Facebook accounts were chosen as an information source for extracting indicators. In total, the Twitter and Facebook accounts of 387 companies and 436 corresponding LinkedIn owner accounts of this data set were manually searched. Information was harvested from these sources and several indicators have been derived from the harvested information.
Two experiments were performed with this data. In the first experiment, a Naive Bayes, J48, Random Forest and Support Vector Machine classifier was trained and tested using solely these Internet features. A comparison of their accuracy to the 31% accuracy of the ZeroR classifier, which as a rule always predicts the most occurring target class, showed that none of the models performed statistically better. In a second experiment, it was tested whether combining Internet features with financial data increases the accuracy. A financial data mining model was created that approximates the rating model of the ratings in our data set and that uses the same financial data as the rating model. The two best performing financial models were built using the Random Forest and J48 classifiers with an accuracy of 68% and 63% respectively. Adding Internet features to these models gave mixed results with a significant decrease and an insignificant increase respectively.
An experimental setup for testing how incorporating uncertainty affects the prediction accuracy of our model is explained. As part of this setup, a search system is described to find candidate results of online information related to a subject and to classify the degree of uncertainty of this online information. It is illustrated how uncertainty can be incorporated into the data mining process.
We are proud to announce the 12th Seminar on Searching and Ranking, with guest presentations by Ingo Frommholz from the University of Bedfordshire, UK, and Tom Heskes from Radboud University Nijmegen, the Netherlands.
More information at: SSR 12.
The Influence of Prosocial Norms and Online Network Structure on Prosocial Behavior: An Analysis of Movember’s Twitter Campaign in 24 Countries
by Tijs van den Broek, Ariana Need, Michel Ehrenhard, Anna Priante and Djoerd Hiemstra
Sociological research points at norms and social networks as antecedents of prosocial behavior. To date, the literature remains undecided on how these factors jointly influence prosocial behavior. Furthermore, the use of social media by campaign organizations may change the need for formal networks to organize large-scale collective action. Hence, in this paper we examine the interplay of prosocial norms and the structure of online social networks on offline prosocial behavior. For this purpose we use donation data from the global Movember campaign, messages about the Movember campaign on the online social networking site Twitter, and data from the World Giving Index. A multi-level analysis of Movember’s campaigns in 24 countries finds support for the logic of connective action: larger and more decentralized networks raise more donations. Furthermore, we find that the effect of prosocial norms on donations is decreased by larger and denser campaign networks.
To be presented at Social media, Activism, and Organizations 2015 (SMAO) on 6 November in Londen, UK.
Recommendations using DBpedia: How your Facebook profile can be used to find your next greeting card
by Anne van de Venis
Recommender systems (RS) are systems that provide suggestions that users may find interesting. In this thesis we present our Interest-Based Recommender System (IBRS) that can recommend tagged item sets from any domain. This RS is validated with item sets from two different domains, namely postcards and holidays homes. While postcards and holiday homes are very different items, with different characteristics, IBRS uses the same recommender engine to create recommendations. IBRS solves several problems that are present in classic RSs, such as the cold-start problem and language independence. The cold-start problem for new users, is solved by using Facebook likes for creating a user profile. It uses information in DBpedia to create recommendations in a tag-based item set for multiple domains, independent of the language. Using both external knowledge sources and user content, makes our system a hybrid of a knowledge-based and content-based RS. We validated our system through an online evaluation system in two evaluation rounds with test user groups of approximately 71 and 44 people. The main contributions in this thesis are:
- a literature study of existing recommendation approaches;
- a language-independent mapping approach for tags and social media resource onto DBpedia resources;
- a domain-independent algorithm for detecting related concepts in the DBpedia graph;
- a recommendation approach based on both Facebook and DBpedia;
- a validation of our recommendation approach.
Optimizing Travel Destinations Based on User Preferences
by Julia Kiseleva (TU Eindhoven), Melanie Müller (Booking.com), Lucas Bernardi (Booking.com), Chad Davis (Booking.com), Ivan Kovacek (Booking.com), Mats Stafseng Einarsen (Booking.com), Jaap Kamps (University of Amsterdam), Alexander Tuzhilin (New York University), Djoerd Hiemstra
Recommendation based on user preferences is a common task for e-commerce websites. New recommendation algorithms are often evaluated by offline comparison to baseline algorithms such as recommending random or the most popular items. Here, we investigate how these algorithms themselves perform and compare to the operational production system in large scale online experiments in a real-world application. Specifically, we focus on recommending travel destinations at Booking.com, a major online travel site, to users searching for their preferred vacation activities. To build ranking models we use multi-criteria rating data provided by previous users after their stay at a destination. We implement three methods and compare them to the current baseline in Booking.com: random, most popular, and Naive Bayes. Our general conclusion is that, in an online A/B test with live users, our Naive-Bayes based ranker increased user engagement significantly over the current online system.
To be presented at SIGIR 2015, the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, on 12 August in Santiago de Chile.
UT-DB: An Experimental Study on Sentiment Analysis in Twitter
Zhemin Zhu, Djoerd Hiemstra, Peter Apers, and Andreas Wombacher
This paper describes our system for participating SemEval 2013 Task 2-B: Sentiment Analysis in Twitter. Given a message, our system classifies whether the message is positive, negative or neutral sentiment. It uses a co-occurrence rate model. The training data are constrained to the data provided by the task organizers (No other tweet data are used). We consider 9 types of features and use a subset of them in our submitted system. To see the contribution of each type of features, we do experimental study on features by leaving one type of features out each time. Results suggest that unigrams are the most important features, bigrams and POS tags seem not helpful, and stopwords should be retained to achieve the best results. The overall results of our system are promising regarding the constrained features and data we use.