Teaching – Djoerd Hiemstra

Current courses

Open projects in OpenWebSearch.eu

Please contact me for open Research Internships, BSc thesis and MSc thesis projects.

Language identification for minority languages. Extend the dataset from Laurie Burchell et al. with languages like Gascon (see keynote of Josiane Mothe at ECIR2024). Alternatively, design and implement a two-stage language identification method, stage one for language groups using stop words only, and a second stage using all words.
Machine learning of generic, language-agnostic tokenizers: Study an optimal word pieces algorithm for information retrieval: Can we improve search by moving beyond word boundaries? Study/Evaluate Query Segmentation algorithms (See: Webis Query Segmentation Corpus)
Federated crawling for the Open Search Foundation: There has been a lot of work on distributed crawling, where there is full cooperation between (geographically) distributed crawlers and all nodes use the same central crawling policy. In federated crawling there is no such central policy, each participant decides what to crawl. A node may reject crawling a url; there may be overlap in pages crawled by different nodes; a node may follow some nodes or block other nodes. What would be the effects of such requirements? In this project, you implement a federated crawler or simulate a crawler using a large existing web crawl such as CommonCrawl.
“Fat head”, “best of the web” web index. Create a web index that contains the “essential” web, the web that answers the most common queries, using query statistics from SEO companies like ahrefs. What is the trade-off? What is the smallest web index that answers most queries? What is the smallest index that answers the top 10% of queries? etc.
The golden web evaluation set: Using the results from the frequency-list project or the query statics from the project above, download search engine result pages from popular search engines to create a dataset that can be used to evaluate new search engines by checking if they retrieve the same top 10 as Google/Bing/Yandex/Baidu
Web page quality ranker: Inspired by the Waterloo Spam Rankings, create a classifier/ranker that assigns a quality score to any web page.
Parameterized distributions for static ranking. Replace a static rank score (like PageRank) by a simple formula (a parameterized distribution), so a search engine does not need to store all static scores and merge them at query time.
Data portability with CIFF for non-western languages. What problems arise if we use CIFF (The Common Index File Format) for languages for which text-processing is non-trivial, like Chinese, for which word boundaries are implicit? or for highly inflectional languages like Arabic? Can CIFF support multiple, different text processors for multilingual collections?

Other projects (some of these are already taken)

Comparison of fairness measures for search: Learning-to-rank methods for search engines (machine learning for search) optimize for clicks and may therefore results in biased results or results that are unfairly amplify click-bate and hate-speech. In this research, you develop methods for measuring the fairness of results and compare existing methods, for instance on simulated data
A bittorrent-based distributed file system: Design and evaluate a file system (inspired by for instance the Google file system) that uses bittorrent/webtorrent to distribute blocks of data over multiple machines and makes sure the replication level is sufficiently high such that no data is lost and the file system remains (eventually) consistent.
Federated Search (Data Science/Software Science):
Research approaches that combine the results from multiple, independent and non-cooperative (in the sense that they do not share their index) search engines
Federated Learning (Data Science):
Research approaches that divides machine learning over multiple independent and private data sources
- Federated Learning-to-Rank: Learn a (personalized) search engine (re-)ranker that never leaves the user’s device.
Federated Social networks (Data Science/Software Science):
- Ephemeral Social networking (Software Science):
  Based on the W3C standard ActivityPub, design an ephemeral social network (in which most posts are removed after some time) and compare its network/storage/memory/cpu load compared to durable solutions like Mastodon.
- Secure federated communication (Digital Security):
  Design/adapt an end-to-end encrypted solution for ActivityPub-based social networking: How to handle multiple devices and heterogeneous networks?
- Transitioning the RU to self-hosted, federated, solutions (Information Sciences): For, for instance, self-hosted web analytics, social networking, or video streaming: What are the user requirements? What solutions meet these requirements? What are additional benefits (for instance more autonomy for employees and students)? How to show this with a proof-of-concept (more info).
With Nedap Healthcare, Groenlo Machine Learning and Natural Language Processing:
Clinical Natural Language Processing / De-identification of medical records.
With RUMC, Nedap Healthcare and Leiden University: MSc thesis project on Generating synthetic clinical data for shared Machine Learning tasks.
Bias in Machine Learning: evaluating spam filters for bias (see: Spam filters are efficient and uncontroversial. Until you look at them)

Past courses

Research Experiments in Databases and Information Retrieval (REDI): Master Course Computer Science (2010-2013, 2015, 2017) [course syllabus]
Foundations of Information Retrieval: Master Course Computer Science & Human Media Interaction (2016-2018; joint effort with Theo Huibers and Dolf Trieschnigg) [course info]
Advanced Research Projects in Information Retrieval: Master Course Computer Science & Human Media Interaction (2016-2018; joint effort with Theo Huibers and Dolf Trieschnigg) [course info]
Search Engine Technology: CuriousU Summer School course (2017, joint effort with Dolf Trieschnigg) [course info]
Data Science: M.Sc. Course Computer Science & Business Information Technology [course info] (2013-2017)
Information Retrieval: Master Course Computer Science & Human Media Interaction (joint effort with Theo Huibers and Dolf Trieschnigg) [course info] (2004-2015)
Web Science: B.Sc. Course. High Tech Human Touch [course info], 2015
Managing Big Data: Master Course Computer Science (joint effort with Robin Aly) [course info] (2012-2015).
Advanced Database Systems: Graduate Course Computer Science [course info] (2013-2015).
Databases: Undergraduate course Computer Science [course info] (2013-2015).
Data & Information: Undergraduate Module [course info]
XML & Databases 1: Master Course Computer Science (joint effort with Maurice van Keulen) [course info]
Distributed Data Processing using MapReduce: Master Course Computer Science (joint effort with Maarten Fokkinga) [course info], 2009-2011
Advanced Database Systems: Master Course Computer Science [vist], 2001-2009
XML & Databases 2: Master Course Computer Science (joint effort with Maurice van Keulen) [vist], 2003-2009
Study advisor for the Computer Science master track Information Systems Engineering, 2004 – 2007
Bijzondere Onderwerpen Gegevensbanken (“Capita Selecta Databases’’), 2001 – 2003
Gedistribueerde databases en middlewaretechnologie, 2001 – 2003
Informatica en Taal (“Computer Science and Language’’), 2001
Lectures “Introduction to Multimedia’’, “Multimedia Databases’’, and “Multimedia Retrieval’’ in the course Advanced distributed multimedia database systems 1, 2004
Lectures “Data Models’’ and “Storage and Access Methods’’ in Databasetoepassingen (‘’Database Applications’’), 2002
Lectures “Object-Oriented Databases’’ in Gegevensbanken, (“Databases’’), 2002
Lectures “Information Retrieval’’, “Markov models’’ and “Part-of-Speech tagging’’ in the course
Taaltechnologie (“Language Technology’’), 1997 – 2004
Lectures “Language Technology for Information Retrieval’’ in the course Information Retrieval en Kennisbeheer, 1999 – 2003

Current courses

Open projects in OpenWebSearch.eu

Other projects (some of these are already taken)

Past courses

Teaching information