Tom Rust graduates on Learned Sparse Retrieval

by Tom Rust

Machine learning algorithms are achieving better results each day and are gaining popularity. The top-performing models are usually deep learning models. These models can absorb vast amounts of training data, improving prediction results. Unfortunately, these models consume a large amount of energy, which is something that not everyone is aware of. In information retrieval, large language models are used to provide extra context to queries and documents. Since information retrieval systems typically have large datasets, a suitable deep learning model must be chosen to find a balance between accuracy and energy usage. Learned sparse retrieval models are an example of these deep learning models. These models work by expanding all documents to create the optimal document representation that allows this document to be found correctly. This step is done before creating the inverted index, allowing for conventional ranking methods such as BM25. With this research, we compare different learned sparse retrieval models in terms of accuracy, speed, size and energy usage. We also compare them with a full-text index. We see that on MS Marco, the learned sparse retrievers outperform the full-text index on all popular evaluation benchmarks. However, the learned sparse retrievers can consume up to 100 times more energy whilst creating the index, which then has a higher query latency, and it uses more disk space. For WT10g we see that the full-text index gives us the highest accuracies whilst also being more energy efficient, using less disk space and having a lower query latency.
We conclude that learned sparse retrieval has the potential to improve accuracy on certain datasets, but a trade-off is necessary between the improved accuracy and the cost of increased storage, latency, and energy consumption.

Proceedings of WOWS 2024

The Proceedings of the first Workshop on Open Web Search (WOWS), which took place on 28 March 2024 in Glasgow, UK, are now published in the CEUR Workshop Series as Volume 3689.

WOWS 2024 had two calls for contributions. The first call targets scientific contributions on cooperative search engine development. This includes cooperative crawling of the web and cooperative deployment and evaluation of search engines. We specifically highlight the potential of enabling public and commercial organizations to use an indexed web crawl as a resource to create innovative search engines tailored to specific user groups, instead of relying on one search engine provider. The second call aims at gaining practical experience with joint, cooperative evaluation of search engine prototypes and their components using the Information Retrieval Experiment Platform TIREx. The workshop involved a keynote by Negar Arabzadeh from the University of Waterloo, 8 paper presentations (5 full papers and 3 short papers accepted out of 13 submissions), and a breakout session with participant discussions. WOWS received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070014. We would like to thank the Program Committee members for helpful reviews and suggestions to improve the contributions to the workshop. Special thanks go to Christine Plote, Managing Director of the Open Search Foundation for the WOWS 2024 website.

https://ceur-ws.org/Vol-3689/

[download pdf]

Semere Bitew defends PhD thesis on Language Models for Education

Language Model Adaptation with Applications in AI for Education

by Semere Kiros Bitew

The overall theme of my dissertation is in adapting language models mainly for applications in AI in education to automatically create educational content. It addresses the challenges in formulating test and exercise questions in educational settings, which traditionally require significant training, experience, time, and resources. This is particularly critical in high-stakes environments like certifications and tests, where questions cannot be reused. In particular, the primary research is focused on two educational tasks: distractor generation and gap-filling exercise generation. Distractor generation task refers to generating plausible but incorrect answers in multiple-choice questions, while gap-filling exercise generation refers to inducing well-chosen gaps to generate grammar exercises from existing texts. These tasks, although extensively researched, present unexplored avenues that recent advancements in language models can address. As a secondary objective, I explore the adaptation of coreference resolution to new languages. Coreference resolution is a key NLP task that involves clustering mentions in a text that refer to the same real-world entities, a process vital for understanding and generating coherent language.

Read more

The Open Web Index

Crawling and Indexing the Web for Public Use

by Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Benno Stein

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index. The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index – for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

To be presented at the European Conference on Information Retrieval (ECIR 2024) in Glasgow on 24-28 March.

[download pdf]

Weighted AUReC

Handling Skew in Shard Map Quality Estimation for Selective Search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC (Area Under Recall Curve) measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.

To be presented at the European Conference on Information Retrieval (ECIR) in Glasgow on 24-28 March.

[download pdf]

Inaugural lecture on 1 March

Invitation

On 1 March 2024 at 15:45h., I will give my inaugural lecture: “Zoekmachines: Samen en duurzaam vooruit” (in Dutch). Everyone is invited. Please register on: https://www.ru.nl/rede/hiemstra

In the lecture, I will share an ancient wisdom about working together; I will discuss my plan to teach students of all background their shared history; and I will reveal my dream to provide unrestricted access to all human information by working together. The lecture will contain cars, iPhone chargers, the Space Shuttle, and references to exciting recent research.

[download pdf]

How NLP became a tool of villains

and how to get the genie back into the bottle

Written for the farewell workshop of Franciska de Jong

Franciska de Jong played a deciding role in my life and career. The first time that she – unknowingly maybe until today – influenced my career was around the year 1993. I was a student computer science at the University of Twente. I was busy living the student live as an active member of the “studentenvakbond SRD” (the union for students), but less active studying. After four years, I only managed to pass the courses of year two; I had doubts about computer science, and I thought about quitting. Two courses of Franciska de Jong removed my doubts completely: Computer Science and Language, and Language Technology (in Dutch: Informatica en Taal and Taaltechnologie) .

The courses discussed the important transition in the field of Natural Language Processing (NLP) that happened in the early 1990’s. Statistical approaches were outperforming symbolic approaches in every application of NLP. I thought this was really exciting. In the courses, we discussed symbolic approaches like those based on Noam Chomsky’s hierarchy of grammars, as well as statistical approaches. For instance, we implemented a simple statistical language model that use letter tri-grams, combinations of three letters. Such a model may generate text like the following:

But the men.
He diperse pre sidevider the siscrat andly is ord the thabbas.
Whall the of thise his de forter he the that's hisell prouressay by inlipothe ever the feer.

What language is this? Surprisingly, this very simple statistical model of language is able to fool our minds. It produces utter nonsense, but it easily convinces people that the nonsense is English. I reimplemented the assignment for this presentation and trained it on Moby Dick. This is English, right?

Also, in the course, we discussed ELIZA, a chatbot developed by Joseph Weizenbaum in the 1960’s. ELIZA is inspired by Rogerian psychiatrists, she reflects back the words of the user as a follow-up question. ELIZA is clearly not intelligent, not in anyway, but people that chatted with ELIZA were easily manipulated into believing that ELIZA was intelligent. This to great frustration of Weizenbaum: he said: “Only people who misunderstood ELIZA called it a sensation.”

YOU: Men are all alike.
ELIZA: Can you think of a specific example ?
YOU: Well, my boyfriend made me come here
ELIZA: Is it important to you that your boyfriend made you come here ?

Now, about 25 years later, we are again witnessing a major transition in the field of Natural Language Processing. Again, these are really exciting times. This time, deep neural approaches are consistently outperforming the statistical models of the 90’s in every conceivable application, including chatbots.

Also, this time, people are easily manipulated into thinking that they talk to an intelligent computer, for instance by chatbots like chatGPT. But chatGPT, like ELIZA, is clearly not intelligent. ChatGPT is based on a model that, like the tri-gram model above, produces the most likely sequence of words. Franciska’s courses were an excellent preparation on calling bullshit on the intelligence of chatGPT.

https://www.nytimes.com/2023/05/01/business/ai-chatbots-hallucination.html

No, chatGPT is not hallucinating: it’s bullshitting.

https://www.cnn.com/2023/10/04/tech/japan-softbank-ai-hnk-intl/

No, we are not anywhere near to so-called “Artificial General Intelligence”: bullshit!

https://www.forbes.com/sites/mollybohannon/2023/06/08/lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/

No, you cannot prepare a court case using chatGPT: What are you DOING?!

Now, you may wonder, if researchers have known for a long time that people are easily manipulated into thinking that machines are intelligent: Why is this again happening, and why on such a large scale? The answer is simple but frightening: Some of us researchers are not very nice people. Some of us researchers are evil. Some of us researchers want to manipulate people. They are like the villains that we know from fiction and cartoons.

https://aclanthology.org/C88-1016.pdf

Here, you see one of the papers that Franciska gave me when I asked for a master thesis topic: The topic? Statistical Machine Translation. I loved this topic. I considered machine translation a wonderful application that may bring together people from different cultures and backgrounds. I also considered machine translation to be a knowledge intensive application, so it is amazing that a statistical system may learn how to translate by just feeding it lots of text.

One of the authors of this paper, however, decided that he would not pursue machine translation further. Instead he went all in on manipulating people. This person was Robert Mercer.

https://www.nytimes.com/2018/04/10/us/politics/mercer-family-cambridge-analytica.html

Robert Mercer was one of the driving forces behind Cambridge Analytica. The company that openly bragged about its ability to influence elections. It cooperated with Facebook to manipulate the Brexit vote, it tried to manipulate the election in the USA in 2020, and it claimed to have manipulated many other elections.

Like villains in fiction, people like Mercer are open about their intentions: They do not only want to make money from manipulating people: They actually believe that most people alive today have not much value. They claim that it is more important to achieve a utopian world in the far future than to solve today’s pressing problems.

Seriously, many of today’s influential techno-optimists are inspired by views like Transhumanism, Effective Altruism, and Longtermism. Timnit Gebru and Émile Torres coined the abbreviation TESCREAL to summarize these views (and some other as well):

  • Transhumanism
  • Extropiansim
  • Singularitarianism
  • Cosmism
  • Rationalism
  • Effective Altruism
  • Longtermism

In these views, achieving artificial general intelligence, or achieving colonies on other planets is the ultimate goal of humanity. Achieving these goals is prioritised over immediate problems such as cutting down on carbon emissions to counter climate change. Also, it is perfectly okay to let workers in Kenya label your data for less than 2 Euro a day (one of the “secrets” behind the success of chatGPT). More on climate disaster and labor exploitation in a minute, but first…

Using language models for IR

… let me go back to my journey with Franciska. In 2001, I defended my PhD thesis supervised by Franciska: “Using Language Models for Information Retrieval”. We may call these statistical language models: small language model today, as opposed to the large language models like GPT. Together with Wessel Kraaij, Arjen de Vries, and Thijs Westerveld, I showed that these models can be implemented using a traditional index. Like the index in the back of a book, such an index lists for each term the pages that contain the term, and it can be used to retrieve documents very efficiently. Web search using small language models therefore takes no more energy than running any other classical search engine.

The system inspects the index, and … done! What does search using large language models look like? Let me show the architecture of one of the most popular BERT rerankers of today. BERT is a transformer-based large language model released by Google in 2018.

First, large language models cannot actually retrieve information, so this approach starts with the same index as above. Then it uses BERT, which consists of 12 layers of transformers where each word or word piece is represented by 768 numbers. Then the system needs an actual reranker, another small neural network.

At ACL 2019 Emma Strubell presented an approach to estimate the energy that is needed to train and use BERT, and at SIGIR 2022, Harry Scells used that approach to estimate the energy needed to use BERT as a reranker and he compared it to the energy needed by the traditional index. It turns our the using a reranker like this takes a staggering 138,000 times more energy than using the index alone. So, for every query that is processed by the BERT large language model ranker, we can process 138,000 queries using the index!

Let me try to give this presentation a positive ending by explaining how to get the genie back into the bottle. I call on researchers to do the following:

  1. Teach about the dark side of AI and NLP: Big corporations are using this technology to manipulate people on a very large scale;
  2. As a researcher, always try simple baselines: Run and optimize a baseline system that uses the index and nothing more. You may still use small language models;
  3. If you use crowd workers, pay them well;
  4. If you do use LARGE language models: Measure energy consumption and estimate carbon emissions.

Finally, In your conclusion: Make a trade-off. If you improved search quality by 20%, but you increased carbon emissions more than 100,000 times … then maybe conclude it is not worth it!

WOWS2024: Workshop on Open Web Search

Co-located with ECIR 2024 in Glasgow on 28 March 2024

The First International Workshop on Open Web Search (WOWS) aims to promote and discuss ideas and approaches to open up the web search ecosystem so that small research groups and young startups can leverage the web to foster an open and diverse search market. Therefore, the workshop has two calls that support collaborative and open web search engines:

  1. for scientific contributions, and
  2. for open source implementation

The first call aims for scientific contributions to building collaborative search engines, including collaborative crawling, collaborative search engine deployment, collaborative search engine evaluation, and collaborative use of the web as a resource for researchers and innovators. The second call aims to gather open-source prototypes and gain practical experience with collaborative, cooperative evaluation of search engines and their components using the TIREx Information Retrieval Evaluation Platform.

Important Dates

  • January 24, 2024 (optional): Early Bird Submissions of Software and Papers. You receive early notifications; Accepted contributions get a free WOWS T-Shirt
  • February 14, 2024: Deadline Submissions of Software and Papers
  • March 13, 2024: Peer review notification
  • March 20, 2024: Camera-ready papers submission
  • March 28, 2024: Workshop (co-located with ECIR 2024 in Glasgow)

More information at: https://opensearchfoundation.org/wows2024/