by Wim van der Zijden
A good practice in business is to focus on key activities. For some companies this may be branding, while other businesses may focus on areas such as consultancy, production or distribution. Focusing on key activities means to outsource as much other activities as possible. These other activities merely distract from the main goals of the company and the company will not be able to excel in them.
Many companies are in need of reliable software to persistently process live data transactions and enable reporting on this data. To fulfil this need, they often have large IT departments in-house. Those departments are costly and distract from the company’s main goals. The emergence of cloud computing should make this no longer necessary. All they need is an internet connection and a service contract with an external provider.
However, most businesses are in need of highly customizable software, because each company has slightly different business processes, even those in the same industry. So even if they outsource their IT need, they will still have to pay expensive developers and business analysts to customize some existing application.
These issues are addressed by Multi-Tenant Customizable (MTC) applications. We define such an application as follows:
A single software solution that can be used by multiple organizations at the same time and which is highly customizable for each organization and user within that organization, by domain experts without a technical background.
A key challenge in designing such a system is to develop a proper persistent data storage, because mainstream databases are optimized for single tenant usage. To this end this Master’s thesis consists of two papers: the first paper proposes an MTC-DB Benchmark, MTCB. This Benchmark allows for objective comparison and evaluation of MTC-DB implementations, as well as providing a framework for the definition of MTC-DB. The second paper describes a number of MTC-DB implementations and uses the benchmark to evaluate those implementations.
by Djoerd Hiemstra and Robin Aly
SIKS/CBS DataCamp participants can download the answers for the Jupyter Scala/Spark notebook exercises below.
The Data Science Platform Netherlands (DSPN) is the national platform for ICT research within the Data Science domain. Data Science is the collection and analysis of so-called ‘Big Data’ according to academic methodology. DSPN unites all Dutch academic research institutions where Data Science is carried out from an ICT perspective, specifically the computer science or applied mathematics perspectives. The objectives of DSPN are to:
- Highlight the importance of ICT research in Big Data and Data Science, especially in national discussions about research and education.
- Exchange and disseminate information about Data Science research and education.
- Build and maintain a network of ICT researchers active in the field of Data Science.
DSPN is launched as part of the ICT Research Platform Netherlands (IPN) to give a voice to the Data Science initiatives of the Dutch ICT research organisations. For more information, see the website at: http://www.datascienceplatform.org/.
by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx
Users tend to articulate their complex information needs in only a few keywords, making underspecified statements of request the main bottleneck for retrieval effectiveness. Taking advantage of feedback information is one of the best ways to enrich the query representation, but can also lead to loss of query focus and harm performance – in particular when the initial query retrieves only little relevant information – when overfitting to accidental features of the particular observed feedback documents. Inspired by the early work of Hans Peter Luhn, we propose significant words language models of feedback documents that capture all, and only, the significant shared terms from feedback documents. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of rare terms that are only explained by specific feedback documents, which eventually results in having only the significant terms left in the feedback model.
Our main contributions are the following. First, we present significant words language models as the effective models capturing the essential terms and their probabilities. Second, we apply the resulting models to the relevance feedback task, and see a better performance over the state-of-the-art methods. Third, we see that the estimation method is remarkably robust making the models insensitive to noisy non-relevant terms in feedback documents. Our general observation is that the significant words language models more accurately capture relevance by excluding general terms and feedback document specific terms.
To be presented at the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016) on October 24-28, 2016 in Indianapolis, United States.
The main objective of the NIRICT research in Data Science is to study the science and technology to unlock the intelligence that is hidden inside Big Data.
The amounts of data that information systems are working with are rapidly increasing. The explosion of data happens in a pace that is unprecedented and in our networked world of today the trend is even accelerating. Companies have transactional data with trillions of bytes of information about their customers, suppliers and operations. Sensors in smart devices generate unparalleled amounts of sensor data. Social media sites and mobile phones have allowed billions of individuals globally to create their own enormous trails of data.
The driving force behind this data explosion is the networked world we live in, where information systems, organizations that employ them, people that use them, and processes that they support are connected and integrated, together with the data contained in those systems.
Unlocking the Hidden Intelligence
Data alone is just a commodity, it is Data Science that converts big data into knowledge and insights. Intelligence is hidden in all sorts of data and data systems.
Data in information systems is usually created and generated for specific purposes: it is mostly designed to support operational processes within organizations. However, as a by-product, such event data provide an enormous source of hidden intelligence about what is happening, but organizations can only capitalize on that intelligence if they are able to extract it and transform the intelligence into novel services.
Analyzing the data provides opportunities for organizations to gather intelligence to capitalize historic and current performance of their processes and exploit future chances for performance improvement.
Another rich source of information and insights is data from the Social Web. Analyzing Social Web Data provides governments, society and companies with better understanding of their community and knowledge about human behavior and preferences.
Each 3TU institute has its own Data Science program, where local data science expertise is bundled and connected to real-world challenges.
Delft Data Science (DDS) – TU Delft
Scientific director: Prof. Geert-Jan Houben
Data Science Center Eindhoven (DSC/e) – TU/e
Scientific director: Prof. Wil van der Aalst
Data Science Center UTwente (DSC UT) – UT
Scientific director: Dr. Djoerd Hiemstra
More information at: https://www.3tu.nl/nirict/en/Research/data-science/.
On 20 April, we organize a Data Science Day in the DesignLab. Invited speakers at the Data Science Colloquium are Piet Daas, methodologist and big data research coordinator of the CBS (Centraal Bureau Statistiek) who will talk about big data from Twitter and Facebook as a data source for official statistics; Rolf de By and Raul Zurita Milla, professors of ITC Geo-Information Science and Earth Observation, will talk about remote sensing techniques, using satelites and drones for helping economies in poor areas in the world, a prestigious project funded by the Bill and Melinda Gates Foundation; and Jan Willem Tulp creator of interactive data visualisations for magazines like Scientific American and Popular Science, as well as companies, for instance the Tax Free Retail Analysis Tool for Schiphol Amsterdam Airport.
The Data Science colloquia are kindly sponsored by the CTIT and the Netherlands Research School for Information and Knowledge Systems (SIKS) and part of the SIKS educational program.
Scientific and economic progress is increasingly powered by our capabilities to explore big datasets. Data is the driving force behind the successful innovation of Internet companies like Google, Twitter, and Yahoo, and job advertisements show an increasing need for data scientists and big data analysts. Data scientists dig for value in data by analyzing for instance texts, application usage logs, and sensor data. The need for data scientists and big data analysts is apparent in almost every sector in our society, including business, health care, and education.
The Twente Center for Data Science is a collaboration between research groups of the University of Twente to research, promote and facilitate big data analysis for all scientific disciplines. The center operates by the participants sharing their expertise, sharing their contacts, sharing their data, and sharing their research infrastructure (hardware and software) for large-scale data analysis.
The Twente Data Science Center offers a unique combination of expertise in computer science, mathematics, management, behavioral sciences and social sciences; collaborations with leading international companies such as Google, Twitter and Yahoo; and local infrastructure and support for the analysis of very large datasets.
The Norvig Web Data Science Award is organized by Common Crawl and SURFsara for researchers and students in the Benelux. SURFsara provides free access to the their Hadoop cluster with a copy of the full Common Crawl web crawl from March 2014 – almost 3 billion web pages. Participants are completely free in choosing their research question. For example, last year there were submissions looking at concept association, connections between languages, readability and more. Be creative and think outside of the box!
The award is named after Peter Norvig, Director of Research at Google, who chairs the jury that will select the winning submission. The contest will run until July 31, 2014. The winning team will be announced at the award ceremony in September 2014 and will get a tablet, smart watch and Github small plan for a year.
Sign up on: http://norvigaward.github.io.
We are very proud that Ravi Kumar from Google agreed to give a keynote speech at the CTIT Symposium on Big Data and the Emergence of Data Science. Kumar, who is well-known for hist work on web and data mining and algorithms for large data sets, has been a senior staff research scientist at Google since June 2012. Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! Research. He obtained his Ph.D. in Computer Science from Cornell University in 1998.
Ravi Kumar's talk will cover two non- conventional computational models for analyzing big data. The first is data streams: in this model, data arrives in a stream and the algorithm is tasked with computing a function of the data without explicitly storing it. The second is map-reduce: in this model, data is distributed across many machines and computation is done as sequence of map and reduce operations. Kumar will present a few algorithms in these models and discuss their scalability.
The workshop takes place on Tuesday 4 June at the University of Twente. Other invited spearkers at the CTIT symposium are Maarten de Rijke (U. Amsterdam) and Milan Petkovic (Philips).
by Lisa Green (Common Crawl)
We are very excited to announce that the winners of the Norvig Web Data Science Award: Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente! The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science and named in honor of distinguished computer scientist Peter Norvig.
There were many excellent submissions that demonstrated how you can extract valuable insight and knowledge from web crawl data. Be sure to check out the work of the winning team, Traitor – Associating Concepts Using The World Wide Web, and the other finalists on the award website. You will find descriptions of the projects as well as links to the code that was used. We hope that these projects will serve as an inspiration for what kind of work can be done with the Common Crawl corpus. All code is open source and we are looking forward to seeing it reused and adapted for other projects.
A huge thank you to our distinguished panel of judges: Peter Norvig, Ricardo Baeza-Yates, Hilary Mason, Jimmy Lin, and Evert Lammerts!
Added on 18 March: Award winners Oliver Jundt, Wanno Drijfhout, and Lesley Wevers with their prize: a high-end Android tablet!