How NLP became a tool of villains

Written for the farewell workshop of prof. Franciska de Jong

Franciska de Jong played a deciding role in my life and career. The first time that she – unknowingly maybe until today – influenced my career was around the year 1993. I was a student computer science at the University of Twente. I was busy living the student live as an active member of the “studentenvakbond SRD” (the union for students), but less active studying. After four years, I only managed to pass the courses of year two; I had doubts about computer science, and I thought about quitting. Two courses of Franciska de Jong removed my doubts completely: Computer Science and Language, and Language Technology (in Dutch: Informatica en Taal and Taaltechnologie) .

The courses discussed the important transition in the field of Natural Language Processing (NLP) that happened in the early 1990’s. Statistical approaches were outperforming symbolic approaches in every application of NLP. I thought this was really exciting. In the courses, we discussed symbolic approaches like those based on Noam Chomsky’s hierarchy of grammars, as well as statistical approaches. For instance, we implemented a simple statistical language model that use letter tri-grams, combinations of three letters. Such a model may generate text like the following:

But the men.
He diperse pre sidevider the siscrat andly is ord the thabbas.
Whall the of thise his de forter he the that's hisell prouressay by inlipothe ever the feer.

What language is this? Surprisingly, this very simple statistical model of language is able to fool our minds. It produces utter nonsense, but it easily convinces people that the nonsense is English. I reimplemented the assignment for this presentation and trained it on Moby Dick. This is English, right?

Also, in the course, we discussed ELIZA, a chatbot developed by Joseph Weizenbaum in the 1960’s. ELIZA is inspired by Rogerian psychiatrists, she reflects back the words of the user as a follow-up question. ELIZA is clearly not intelligent, not in anyway, but people that chatted with ELIZA were easily manipulated into believing that ELIZA was intelligent. This to great frustration of Weizenbaum: he said: “Only people who misunderstood ELIZA called it a sensation.”

YOU: Men are all alike.
ELIZA: Can you think of a specific example ?
YOU: Well, my boyfriend made me come here
ELIZA: Is it important to you that your boyfriend made you come here ?

Now, about 25 years later, we are again witnessing a major transition in the field of Natural Language Processing. Again, these are really exciting times. This time, deep neural approaches are consistently outperforming the statistical models of the 90’s in every conceivable application, including chatbots.

Also, this time, people are easily manipulated into thinking that they talk to an intelligent computer, for instance by chatbots like chatGPT. But chatGPT, like ELIZA, is clearly not intelligent. ChatGPT is based on a model that, like the tri-gram model above, produces the most likely sequence of words. Franciska’s courses were an excellent preparation on calling bullshit on the intelligence of chatGPT.

No, chatGPT is not hallucinating: it’s bullshitting.

No, we are not anywhere near to so-called “Artificial General Intelligence”: bullshit!

No, you cannot prepare a court case using chatGPT: What are you DOING?!

Now, you may wonder, if researchers have known for a long time that people are easily manipulated into thinking that machines are intelligent: Why is this again happening, and why on such a large scale? The answer is simple but frightening: Some of us researchers are not very nice people. Some of us researchers are evil. Some of us researchers want to manipulate people. They are like the villains that we know from fiction and cartoons.

Here, you see one of the papers that Franciska gave me when I asked for a master thesis topic: The topic? Statistical Machine Translation. I loved this topic. I considered machine translation a wonderful application that may bring together people from different cultures and backgrounds. I also considered machine translation to be a knowledge intensive application, so it is amazing that a statistical system may learn how to translate by just feeding it lots of text.

One of the authors of this paper, however, decided that he would not pursue machine translation further. Instead he went all in on manipulating people. This person was Robert Mercer.

Robert Mercer was one of the driving forces behind Cambridge Analytica. The company that openly bragged about its ability to influence elections. It cooperated with Facebook to manipulate the Brexit vote, it tried to manipulate the election in the USA in 2020, and it claimed to have manipulated many other elections.

Like villains in fiction, people like Mercer are open about their intentions: They do not only want to make money from manipulating people: They actually believe that most people alive today have not much value. They claim that it is more important to achieve a utopian world in the far future than to solve today’s pressing problems.

Seriously, many of today’s influential techno-optimists are inspired by views like Transhumanism, Effective Altruism, and Longtermism. Timnit Gebru and Émile Torres coined the abbreviation TESCREAL to summarize these views (and some other as well):

  • Transhumanism
  • Extropiansim
  • Singularitarianism
  • Cosmism
  • Rationalism
  • Effective Altruism
  • Longtermism

In these views, achieving artificial general intelligence, or achieving colonies on other planets is the ultimate goal of humanity. Achieving these goals is prioritised over immediate problems such as cutting down on carbon emissions to counter climate change. Also, it is perfectly okay to let workers in Kenya label your data for less than 2 Euro a day (one of the “secrets” behind the success of chatGPT). More on climate disaster and labor exploitation in a minute, but first…

Using language models for IR

… let me go back to my journey with Franciska. In 2001, I defended my PhD thesis supervised by Franciska: “Using Language Models for Information Retrieval”. We may call these statistical language models: small language model today, as opposed to the large language models like GPT. Together with Wessel Kraaij, Arjen de Vries, and Thijs Westerveld, I showed that these models can be implemented using a traditional index. Like the index in the back of a book, such an index lists for each term the pages that contain the term, and it can be used to retrieve documents very efficiently. Web search using small language models therefore takes no more energy than running any other classical search engine.

The system inspects the index, and … done! What does search using large language models look like? Let me show the architecture of one of the most popular BERT rerankers of today. BERT is a transformer-based large language model released by Google in 2018.

First large language models cannot

First, large language models cannot actually retrieve information, so this approach starts with the same index as above. Then it uses BERT, which consists of 12 layers of transformers where each word or word piece is represented by 768 numbers. Then the system needs an actual reranker, another small neural network.

At ACL 2019 Emma Strubell presented an approach to estimate the energy that is needed to train and use BERT, and at SIGIR 2022, Harry Scells used that approach to estimate the energy needed to use BERT as a reranker and he compared it to the energy needed by the traditional index. It turns our the using a reranker like this takes a staggering 138,000 times more energy than using the index alone. So, for every query that is processed by the BERT large language model ranker, we can process 138,000 queries using the index!

Let me try to give this presentation a positive ending by explaining how to get the genie back into the bottle. I call on researchers to do the following:

  1. Teach about the dark side of AI and NLP: Big corporations are using this technology to manipulate people on a very large scale;
  2. As a researcher, always try simple baselines: Run and optimize a baseline system that uses the index and nothing more. You may still use small language models;
  3. If you use crowd workers, pay them well;
  4. If you do use LARGE language models: Measure energy consumption and estimate carbon emissions.

Finally, In your conclusion: Make a trade-off. If you improved search quality by 20%, but you increased carbon emissions more than 100,000 times … then maybe conclude it is not worth it!

WOWS2024: Workshop on Open Web Search

Co-located with ECIR 2024 in Glasgow on 28 March 2024

The First International Workshop on Open Web Search (WOWS) aims to promote and discuss ideas and approaches to open up the web search ecosystem so that small research groups and young startups can leverage the web to foster an open and diverse search market. Therefore, the workshop has two calls that support collaborative and open web search engines:

  1. for scientific contributions, and
  2. for open source implementation

The first call aims for scientific contributions to building collaborative search engines, including collaborative crawling, collaborative search engine deployment, collaborative search engine evaluation, and collaborative use of the web as a resource for researchers and innovators. The second call aims to gather open-source prototypes and gain practical experience with collaborative, cooperative evaluation of search engines and their components using the TIREx Information Retrieval Evaluation Platform.

Important Dates

  • January 24, 2024 (optional): Early Bird Submissions of Software and Papers. You receive early notifications; Accepted contributions get a free WOWS T-Shirt
  • February 14, 2024: Deadline Submissions of Software and Papers
  • March 13, 2024: Peer review notification
  • March 20, 2024: Camera-ready papers submission
  • March 28, 2024: Workshop (co-located with ECIR 2024 in Glasgow)

More information at:

Challenges of index exchange for search engine interoperability

by Djoerd Hiemstra, Gijs Hendriksen, Chris Kamphuis, and Arjen de Vries

We discuss tokenization challenges that arise when sharing inverted file indexes to support interoperability between search engines, in particular: How to tokenize queries such that the tokens are consistent with the tokens in the shared index? We discuss various solutions and present preliminary experimental results that show when the problem occurs and how it can be mitigated by standardizing on a simple, generic tokenizer for all shared indexes.

To be presented at the 5th International Open Search Symposium #OSSYM2023 at CERN, Geneva, Switzerland on 4-6 October 2023

[download pdf]

Impact and development of an Open Web Index for open web search

by Michael Granitzer, Stefan Voigt, Noor Afshan Fathima, Martin Golasowski, Christian Guetl, Tobias Hecking, Gijs Hendriksen, Djoerd Hiemstra, Jan Martinovič, Jelena Mitrović, Izidor Mlakar, Stavros Moiras, Alexander Nussbaumer, Per Öster, Martin Potthast, Marjana Senčar Srdič, Sharikadze Megi, Kateřina Slaninová, Benno Stein, Arjen P. de Vries, Vít Vondrák, Andreas Wagner, Saber Zerhoudi

Web search is a crucial technology for the digital economy. Dominated by a few gatekeepers focused on commercial success, however, web publishers have to optimize their content for these gatekeepers, resulting in a closed ecosystem of search engines as well as the risk of publishers sacrificing quality. To encourage an open search ecosystem and offer users genuine choice among alternative search engines, we propose the development of an Open Web Index (OWI). We outline six core principles for developing and maintaining an open index, based on open data principles, legal compliance, and collaborative technology development. The combination of an open index with what we call declarative search engines will facilitate the development of vertical search engines and innovative web data products (including, e.g., large language models), enabling a fair and open information space. This framework underpins the EU-funded project OpenWebSearch.EU, marking the first step towards realizing an Open Web Index.

Published by the Journal of the American Society of Information Science and Technology (JASIST)

[download pdf]

Fausto de Lang graduates on tokenization for information retrieval

An empirical study of the effect of vocabulary size for various tokenization strategies in passage retrieval performance.

by Fausto de Lang

Many interactions between the the fields of lexical retrieval and large language models still remain underexplored, in particular there is little research into the use of advanced language model tokenizers in combination with classical information retrieval mechanisms. This research looks into the effect of vocabulary size for various tokenization strategies in passage retrieval performance. It also provides an overview of the impact of the WordPiece, Byte-Pair Encoding and Unigram tokenization techniques on the MSMARCO passage retreival task. These techniques are explored in both re-trained tokenizers and in tokenizers trained from scratch. Based on three metrics this research has found the WordPiece tokenization technique is the best performing technique on the MSMARCO passage retrieval tasks. It has also found that a training vocabulary size of around 10,000 tokens is best in regards to Recall performance, while around 320,000 tokens shows the optimal Mean Reciprocal Rank and Normalized Discounted Cumulative Gain scores. Most importantly, the optimum at a relatively small vocabulary size suggests that shorter subwords can benefit the indexing and searching process (up to a certain point). This is a meaningful result since it means that many applications where (re-)trained tokenizers are used in information retrieval capacity might be improved by tweaking the vocabulary size during training. This research has mainly focused on building a bridge between (re-)trainable tokenizers and information retrieval software, while reporting on interesting tunable parameters. Finally, this research recommends researchers to build their
own tokenizer from scratch since it forces one to look at the configuration of the underlying processing steps.

Defended on 27 June 2023

Git repository at:

UNFair: Search Engine Manipulation, Undetectable by Amortized Inequity

by Tim de Jonge and Djoerd Hiemstra

Modern society increasingly relies on Information Retrieval systems to answer various information needs. Since this impacts society in many ways, there has been a great deal of work to ensure the fairness of these systems, and to prevent societal harms. There is a prevalent risk of failing to model the entire system, where nefarious actors can produce harm outside the scope of fairness metrics. We demonstrate the practical possibility of this risk through UNFair, a ranking system that achieves performance and measured fairness competitive with current state-of-the-art, while simultaneously being manipulative in setup. UNFair demonstrates how adhering to a fairness metric, Amortized Equity, can be insufficient to prevent Search Engine Manipulation. This possibility of manipulation bypassing a fairness metric discourages imposing a fairness metric ahead of time, and motivates instead a more holistic approach to fairness assessments.

To be presented at the ACM Conference on Fairness, Accountability, and Transparency (FAccT 2023) on 12-15 June in Chicago, USA.

[download pdf]

Cross-Market Product-Related Question Answering

by Negin Ghasemi, Mohammad Aliannejadi, Hamed Bonab, Evangelos Kanoulas, Arjen de Vries, James Allan, and Djoerd Hiemstra

Online shops such as Amazon, eBay, and Etsy continue to expand their presence in multiple countries, creating new resource-scarce marketplaces with thousands of items. We consider a marketplace to be resource-scarce when only limited user-generated data is available about the products (e.g., ratings, reviews, and product-related questions). In such a marketplace, an information retrieval system is less likely to help users find answers to their questions about the products. As a result, questions posted online may go unanswered for extended periods. This study investigates the impact of using available data in a resource-rich marketplace to answer new questions in a resource-scarce marketplace, a new problem we call cross-market question answering. To study this problem’s potential impact, we collect and annotate a new dataset, XMarket-QA, from Amazon’s UK (resource-scarce) and US (resource-rich) local marketplaces. We conduct a data analysis to understand the scope of the cross-market question-answering task. This analysis shows a temporal gap of almost one year between the first question answered in the UK marketplace and the US marketplace. Also, it shows that the first question about a product is posted in the UK marketplace only when 28 questions, on average, have already been answered about the same product in the US marketplace. Human annotations demonstrate that, on average, 65% of the questions in the UK marketplace can be answered within the US marketplace, supporting the concept of cross-market question answering. Inspired by these findings, we develop a new method, CMJim, which utilizes product similarities across marketplaces in the training phase for retrieving answers from the resource-rich marketplace that can be used to answer a question in the resource-scarce marketplace. Our evaluations show CMJim’s significant improvement compared to competitive baselines.

To be presented at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023) on July 23-27 in Taipei, Taiwan.

[download pdf]

#OSSYM2023 at CERN

The Open Search Symposium #OSSYM2023 brings together the Open Internet Search community in Europe for the fifth time this year. The interactive conference provides a forum to discuss and further develop the ideas and concepts of open internet search. Participants include researchers, data centres, libraries, policy makers, legal and ethical experts, and society.

#OSSYM2023 takes place at CERN, Geneva, Switzerland on 4-6 October 2023 organized by the Open Search Foundation. The Call for Papers ends 31 May 2023.

More info at:

Open Web Search project kicked off

Today, we kick-off our new EU project In the project, we develop a new architecture for search engines where many parts of the system will be decentralized. The key idea is to separate index construction from the search engines themselves, where the most expensive step to create index shards can be carried out on large clusters while the search engine itself can be operated locally.

We also envision an Open-Web-Search Engine Hub, where companies and individuals can share their specifications of search engines and pre-computed, regularly updated search indices. We think of this as a search engine mash-up, that would enable a new future of human-centric search without privacy concerns.

More information at: