How NLP became a tool of villains

and how to get the genie back into the bottle

Written for the farewell workshop of Franciska de Jong

Franciska de Jong played a deciding role in my life and career. The first time that she – unknowingly maybe until today – influenced my career was around the year 1993. I was a student computer science at the University of Twente. I was busy living the student live as an active member of the “studentenvakbond SRD” (the union for students), but less active studying. After four years, I only managed to pass the courses of year two; I had doubts about computer science, and I thought about quitting. Two courses of Franciska de Jong removed my doubts completely: Computer Science and Language, and Language Technology (in Dutch: Informatica en Taal and Taaltechnologie) .

The courses discussed the important transition in the field of Natural Language Processing (NLP) that happened in the early 1990’s. Statistical approaches were outperforming symbolic approaches in every application of NLP. I thought this was really exciting. In the courses, we discussed symbolic approaches like those based on Noam Chomsky’s hierarchy of grammars, as well as statistical approaches. For instance, we implemented a simple statistical language model that use letter tri-grams, combinations of three letters. Such a model may generate text like the following:

But the men.
He diperse pre sidevider the siscrat andly is ord the thabbas.
Whall the of thise his de forter he the that's hisell prouressay by inlipothe ever the feer.

What language is this? Surprisingly, this very simple statistical model of language is able to fool our minds. It produces utter nonsense, but it easily convinces people that the nonsense is English. I reimplemented the assignment for this presentation and trained it on Moby Dick. This is English, right?

Also, in the course, we discussed ELIZA, a chatbot developed by Joseph Weizenbaum in the 1960’s. ELIZA is inspired by Rogerian psychiatrists, she reflects back the words of the user as a follow-up question. ELIZA is clearly not intelligent, not in anyway, but people that chatted with ELIZA were easily manipulated into believing that ELIZA was intelligent. This to great frustration of Weizenbaum: he said: “Only people who misunderstood ELIZA called it a sensation.”

YOU: Men are all alike.
ELIZA: Can you think of a specific example ?
YOU: Well, my boyfriend made me come here
ELIZA: Is it important to you that your boyfriend made you come here ?

Now, about 25 years later, we are again witnessing a major transition in the field of Natural Language Processing. Again, these are really exciting times. This time, deep neural approaches are consistently outperforming the statistical models of the 90’s in every conceivable application, including chatbots.

Also, this time, people are easily manipulated into thinking that they talk to an intelligent computer, for instance by chatbots like chatGPT. But chatGPT, like ELIZA, is clearly not intelligent. ChatGPT is based on a model that, like the tri-gram model above, produces the most likely sequence of words. Franciska’s courses were an excellent preparation on calling bullshit on the intelligence of chatGPT.

No, chatGPT is not hallucinating: it’s bullshitting.

No, we are not anywhere near to so-called “Artificial General Intelligence”: bullshit!

No, you cannot prepare a court case using chatGPT: What are you DOING?!

Now, you may wonder, if researchers have known for a long time that people are easily manipulated into thinking that machines are intelligent: Why is this again happening, and why on such a large scale? The answer is simple but frightening: Some of us researchers are not very nice people. Some of us researchers are evil. Some of us researchers want to manipulate people. They are like the villains that we know from fiction and cartoons.

Here, you see one of the papers that Franciska gave me when I asked for a master thesis topic: The topic? Statistical Machine Translation. I loved this topic. I considered machine translation a wonderful application that may bring together people from different cultures and backgrounds. I also considered machine translation to be a knowledge intensive application, so it is amazing that a statistical system may learn how to translate by just feeding it lots of text.

One of the authors of this paper, however, decided that he would not pursue machine translation further. Instead he went all in on manipulating people. This person was Robert Mercer.

Robert Mercer was one of the driving forces behind Cambridge Analytica. The company that openly bragged about its ability to influence elections. It cooperated with Facebook to manipulate the Brexit vote, it tried to manipulate the election in the USA in 2020, and it claimed to have manipulated many other elections.

Like villains in fiction, people like Mercer are open about their intentions: They do not only want to make money from manipulating people: They actually believe that most people alive today have not much value. They claim that it is more important to achieve a utopian world in the far future than to solve today’s pressing problems.

Seriously, many of today’s influential techno-optimists are inspired by views like Transhumanism, Effective Altruism, and Longtermism. Timnit Gebru and Émile Torres coined the abbreviation TESCREAL to summarize these views (and some other as well):

  • Transhumanism
  • Extropiansim
  • Singularitarianism
  • Cosmism
  • Rationalism
  • Effective Altruism
  • Longtermism

In these views, achieving artificial general intelligence, or achieving colonies on other planets is the ultimate goal of humanity. Achieving these goals is prioritised over immediate problems such as cutting down on carbon emissions to counter climate change. Also, it is perfectly okay to let workers in Kenya label your data for less than 2 Euro a day (one of the “secrets” behind the success of chatGPT). More on climate disaster and labor exploitation in a minute, but first…

Using language models for IR

… let me go back to my journey with Franciska. In 2001, I defended my PhD thesis supervised by Franciska: “Using Language Models for Information Retrieval”. We may call these statistical language models: small language model today, as opposed to the large language models like GPT. Together with Wessel Kraaij, Arjen de Vries, and Thijs Westerveld, I showed that these models can be implemented using a traditional index. Like the index in the back of a book, such an index lists for each term the pages that contain the term, and it can be used to retrieve documents very efficiently. Web search using small language models therefore takes no more energy than running any other classical search engine.

The system inspects the index, and … done! What does search using large language models look like? Let me show the architecture of one of the most popular BERT rerankers of today. BERT is a transformer-based large language model released by Google in 2018.

First, large language models cannot actually retrieve information, so this approach starts with the same index as above. Then it uses BERT, which consists of 12 layers of transformers where each word or word piece is represented by 768 numbers. Then the system needs an actual reranker, another small neural network.

At ACL 2019 Emma Strubell presented an approach to estimate the energy that is needed to train and use BERT, and at SIGIR 2022, Harry Scells used that approach to estimate the energy needed to use BERT as a reranker and he compared it to the energy needed by the traditional index. It turns our the using a reranker like this takes a staggering 138,000 times more energy than using the index alone. So, for every query that is processed by the BERT large language model ranker, we can process 138,000 queries using the index!

Let me try to give this presentation a positive ending by explaining how to get the genie back into the bottle. I call on researchers to do the following:

  1. Teach about the dark side of AI and NLP: Big corporations are using this technology to manipulate people on a very large scale;
  2. As a researcher, always try simple baselines: Run and optimize a baseline system that uses the index and nothing more. You may still use small language models;
  3. If you use crowd workers, pay them well;
  4. If you do use LARGE language models: Measure energy consumption and estimate carbon emissions.

Finally, In your conclusion: Make a trade-off. If you improved search quality by 20%, but you increased carbon emissions more than 100,000 times … then maybe conclude it is not worth it!