The Open Web Index

Crawling and Indexing the Web for Public Use

by Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor
Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi,
Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin
Potthast, and Benno Stein

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index. The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index – for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

To be presented at the European Conference on Information Retrieval (ECIR 2024) in Glasgow on 24-28 March.

[download pdf]

Weighted AUReC

Handling Skew in Shard Map
Quality Estimation for Selective Search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC (Area Under Recall Curve) measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.

To be presented at the European Conference on Information Retrieval (ECIR) in Glasgow on 24-28 March.

[download pdf]

Inaugural lecture on 1 March

On 1 March 2024 at 15:45h., I will give my inaugural lecture: “Zoekmachines: Samen en duurzaam vooruit” (in Dutch). Everyone is invited. Please register on: https://www.ru.nl/rede/hiemstra

In the lecture, I will share an ancient wisdom about working together; I will discuss my plan to teach students of all background their shared history; and I will reveal my dream to provide unrestricted access to all human information by working together. The lecture will contain cars, Star Trek characters and references to exciting recent research.

Uitnodiging Oratie

How NLP became a tool of villains

and how to get the genie back into the bottle

Written for the farewell workshop of Franciska de Jong

Franciska de Jong played a deciding role in my life and career. The first time that she – unknowingly maybe until today – influenced my career was around the year 1993. I was a student computer science at the University of Twente. I was busy living the student live as an active member of the “studentenvakbond SRD” (the union for students), but less active studying. After four years, I only managed to pass the courses of year two; I had doubts about computer science, and I thought about quitting. Two courses of Franciska de Jong removed my doubts completely: Computer Science and Language, and Language Technology (in Dutch: Informatica en Taal and Taaltechnologie) .

The courses discussed the important transition in the field of Natural Language Processing (NLP) that happened in the early 1990’s. Statistical approaches were outperforming symbolic approaches in every application of NLP. I thought this was really exciting. In the courses, we discussed symbolic approaches like those based on Noam Chomsky’s hierarchy of grammars, as well as statistical approaches. For instance, we implemented a simple statistical language model that use letter tri-grams, combinations of three letters. Such a model may generate text like the following:

But the men.
He diperse pre sidevider the siscrat andly is ord the thabbas.
Whall the of thise his de forter he the that's hisell prouressay by inlipothe ever the feer.

What language is this? Surprisingly, this very simple statistical model of language is able to fool our minds. It produces utter nonsense, but it easily convinces people that the nonsense is English. I reimplemented the assignment for this presentation and trained it on Moby Dick. This is English, right?

Also, in the course, we discussed ELIZA, a chatbot developed by Joseph Weizenbaum in the 1960’s. ELIZA is inspired by Rogerian psychiatrists, she reflects back the words of the user as a follow-up question. ELIZA is clearly not intelligent, not in anyway, but people that chatted with ELIZA were easily manipulated into believing that ELIZA was intelligent. This to great frustration of Weizenbaum: he said: “Only people who misunderstood ELIZA called it a sensation.”

YOU: Men are all alike.
ELIZA: Can you think of a specific example ?
YOU: Well, my boyfriend made me come here
ELIZA: Is it important to you that your boyfriend made you come here ?

Now, about 25 years later, we are again witnessing a major transition in the field of Natural Language Processing. Again, these are really exciting times. This time, deep neural approaches are consistently outperforming the statistical models of the 90’s in every conceivable application, including chatbots.

Also, this time, people are easily manipulated into thinking that they talk to an intelligent computer, for instance by chatbots like chatGPT. But chatGPT, like ELIZA, is clearly not intelligent. ChatGPT is based on a model that, like the tri-gram model above, produces the most likely sequence of words. Franciska’s courses were an excellent preparation on calling bullshit on the intelligence of chatGPT.

https://www.nytimes.com/2023/05/01/business/ai-chatbots-hallucination.html

No, chatGPT is not hallucinating: it’s bullshitting.

https://www.cnn.com/2023/10/04/tech/japan-softbank-ai-hnk-intl/

No, we are not anywhere near to so-called “Artificial General Intelligence”: bullshit!

https://www.forbes.com/sites/mollybohannon/2023/06/08/lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/

No, you cannot prepare a court case using chatGPT: What are you DOING?!

Now, you may wonder, if researchers have known for a long time that people are easily manipulated into thinking that machines are intelligent: Why is this again happening, and why on such a large scale? The answer is simple but frightening: Some of us researchers are not very nice people. Some of us researchers are evil. Some of us researchers want to manipulate people. They are like the villains that we know from fiction and cartoons.

https://aclanthology.org/C88-1016.pdf

Here, you see one of the papers that Franciska gave me when I asked for a master thesis topic: The topic? Statistical Machine Translation. I loved this topic. I considered machine translation a wonderful application that may bring together people from different cultures and backgrounds. I also considered machine translation to be a knowledge intensive application, so it is amazing that a statistical system may learn how to translate by just feeding it lots of text.

One of the authors of this paper, however, decided that he would not pursue machine translation further. Instead he went all in on manipulating people. This person was Robert Mercer.

https://www.nytimes.com/2018/04/10/us/politics/mercer-family-cambridge-analytica.html

Robert Mercer was one of the driving forces behind Cambridge Analytica. The company that openly bragged about its ability to influence elections. It cooperated with Facebook to manipulate the Brexit vote, it tried to manipulate the election in the USA in 2020, and it claimed to have manipulated many other elections.

Like villains in fiction, people like Mercer are open about their intentions: They do not only want to make money from manipulating people: They actually believe that most people alive today have not much value. They claim that it is more important to achieve a utopian world in the far future than to solve today’s pressing problems.

Seriously, many of today’s influential techno-optimists are inspired by views like Transhumanism, Effective Altruism, and Longtermism. Timnit Gebru and Émile Torres coined the abbreviation TESCREAL to summarize these views (and some other as well):

  • Transhumanism
  • Extropiansim
  • Singularitarianism
  • Cosmism
  • Rationalism
  • Effective Altruism
  • Longtermism

In these views, achieving artificial general intelligence, or achieving colonies on other planets is the ultimate goal of humanity. Achieving these goals is prioritised over immediate problems such as cutting down on carbon emissions to counter climate change. Also, it is perfectly okay to let workers in Kenya label your data for less than 2 Euro a day (one of the “secrets” behind the success of chatGPT). More on climate disaster and labor exploitation in a minute, but first…

Using language models for IR

… let me go back to my journey with Franciska. In 2001, I defended my PhD thesis supervised by Franciska: “Using Language Models for Information Retrieval”. We may call these statistical language models: small language model today, as opposed to the large language models like GPT. Together with Wessel Kraaij, Arjen de Vries, and Thijs Westerveld, I showed that these models can be implemented using a traditional index. Like the index in the back of a book, such an index lists for each term the pages that contain the term, and it can be used to retrieve documents very efficiently. Web search using small language models therefore takes no more energy than running any other classical search engine.

The system inspects the index, and … done! What does search using large language models look like? Let me show the architecture of one of the most popular BERT rerankers of today. BERT is a transformer-based large language model released by Google in 2018.

First, large language models cannot actually retrieve information, so this approach starts with the same index as above. Then it uses BERT, which consists of 12 layers of transformers where each word or word piece is represented by 768 numbers. Then the system needs an actual reranker, another small neural network.

At ACL 2019 Emma Strubell presented an approach to estimate the energy that is needed to train and use BERT, and at SIGIR 2022, Harry Scells used that approach to estimate the energy needed to use BERT as a reranker and he compared it to the energy needed by the traditional index. It turns our the using a reranker like this takes a staggering 138,000 times more energy than using the index alone. So, for every query that is processed by the BERT large language model ranker, we can process 138,000 queries using the index!

Let me try to give this presentation a positive ending by explaining how to get the genie back into the bottle. I call on researchers to do the following:

  1. Teach about the dark side of AI and NLP: Big corporations are using this technology to manipulate people on a very large scale;
  2. As a researcher, always try simple baselines: Run and optimize a baseline system that uses the index and nothing more. You may still use small language models;
  3. If you use crowd workers, pay them well;
  4. If you do use LARGE language models: Measure energy consumption and estimate carbon emissions.

Finally, In your conclusion: Make a trade-off. If you improved search quality by 20%, but you increased carbon emissions more than 100,000 times … then maybe conclude it is not worth it!

WOWS2024: Workshop on Open Web Search

Co-located with ECIR 2024 in Glasgow on 28 March 2024

The First International Workshop on Open Web Search (WOWS) aims to promote and discuss ideas and approaches to open up the web search ecosystem so that small research groups and young startups can leverage the web to foster an open and diverse search market. Therefore, the workshop has two calls that support collaborative and open web search engines:

  1. for scientific contributions, and
  2. for open source implementation

The first call aims for scientific contributions to building collaborative search engines, including collaborative crawling, collaborative search engine deployment, collaborative search engine evaluation, and collaborative use of the web as a resource for researchers and innovators. The second call aims to gather open-source prototypes and gain practical experience with collaborative, cooperative evaluation of search engines and their components using the TIREx Information Retrieval Evaluation Platform.

Important Dates

  • January 24, 2024 (optional): Early Bird Submissions of Software and Papers. You receive early notifications; Accepted contributions get a free WOWS T-Shirt
  • February 14, 2024: Deadline Submissions of Software and Papers
  • March 13, 2024: Peer review notification
  • March 20, 2024: Camera-ready papers submission
  • March 28, 2024: Workshop (co-located with ECIR 2024 in Glasgow)

More information at: https://opensearchfoundation.org/wows2024/

Challenges of index exchange for search engine interoperability

by Djoerd Hiemstra, Gijs Hendriksen, Chris Kamphuis, and Arjen de Vries

We discuss tokenization challenges that arise when sharing inverted file indexes to support interoperability between search engines, in particular: How to tokenize queries such that the tokens are consistent with the tokens in the shared index? We discuss various solutions and present preliminary experimental results that show when the problem occurs and how it can be mitigated by standardizing on a simple, generic tokenizer for all shared indexes.

To be presented at the 5th International Open Search Symposium #OSSYM2023 at CERN, Geneva, Switzerland on 4-6 October 2023

[download pdf]

Impact and development of an Open Web Index for open web search

by Michael Granitzer, Stefan Voigt, Noor Afshan Fathima, Martin Golasowski, Christian Guetl, Tobias Hecking, Gijs Hendriksen, Djoerd Hiemstra, Jan Martinovič, Jelena Mitrović, Izidor Mlakar, Stavros Moiras, Alexander Nussbaumer, Per Öster, Martin Potthast, Marjana Senčar Srdič, Sharikadze Megi, Kateřina Slaninová, Benno Stein, Arjen P. de Vries, Vít Vondrák, Andreas Wagner, Saber Zerhoudi

Web search is a crucial technology for the digital economy. Dominated by a few gatekeepers focused on commercial success, however, web publishers have to optimize their content for these gatekeepers, resulting in a closed ecosystem of search engines as well as the risk of publishers sacrificing quality. To encourage an open search ecosystem and offer users genuine choice among alternative search engines, we propose the development of an Open Web Index (OWI). We outline six core principles for developing and maintaining an open index, based on open data principles, legal compliance, and collaborative technology development. The combination of an open index with what we call declarative search engines will facilitate the development of vertical search engines and innovative web data products (including, e.g., large language models), enabling a fair and open information space. This framework underpins the EU-funded project OpenWebSearch.EU, marking the first step towards realizing an Open Web Index.

Published by the Journal of the American Society of Information Science and Technology (JASIST)

[download pdf]

Artificial intelligence: there are problems we need to address right now, the rest is science fiction

by Frederik Zuiderveen Borgesius, Marvin van Bekkum, and Djoerd Hiemstra

Everywhere you read warnings of ‘existential risks’ from artificial intelligence (AI). Some even warn that AI could wipe out humanity. The tech company OpenAI is predicting the emergence of artificial general intelligence and superintelligence, and of future AI systems that will be more intelligent than humans. Some policymakers also fear this kind of scenario.

But things are not moving that fast. ‘Artificial general intelligence’ means an AI system that, like humans, can perform a variety of different tasks. There is no such general AI at present, and even if it does come one day, creating it will take a very long time.

Many AI systems are useful. Search engines, for example, are indispensable to internet users, and are a good example of specific AI. A specific AI system can perform one task well, such as pointing people to the right website. Modern spam filters, translation software, and speech recognition software also work well thanks to specific AI.

But these are still examples of specific AI – far removed from general AI, let alone ‘superintelligence’. Humans can learn new things. AI systems cannot. What computer scientists are getting better and better at is creating general large language models that can be used for all kinds of specific AI. The same language model can be used for translation software, spam filters, and search engines. Does this mean that such a language model has general intelligence? Could it develop consciousness? Absolutely not! There is therefore no real risk of a science fiction scenario in which an AI system wipes out humanity.

This focus on existential risks distracts us from the real risks at hand, which require our attention right now. Little remains of our privacy, for example. AI systems are trained using data, lots of data. That is why AI developers, mostly big tech companies, are collecting massive amounts of data. For instance, OpenAI presumably gobbled up large sections of the web to develop ChatGPT, including personal data. Incidentally, OpenAI is quite secretive about what data it uses.

Secondly, the use of AI can lead to unfair discrimination. For example, many facial recognition systems do not work well for people with darker skin tones. In the US, the police have repeatedly arrested the wrong person because a facial recognition system wrongly identified the dark-skinned men as criminals.

Thirdly, AI systems consume incredible amounts of electricity. Training and using language models like GPT require a lot of computing power from large data centres, which guzzle energy. Finally, the power of big tech companies is only growing with the use of AI systems. Developing AI systems costs a lot of money, so as the use of AI increases, we become even more dependent on big tech companies. These kinds of risks are already here now. Let’s focus on that, and not let ourselves be distracted by the ghost of sentient AI.

Published by Radboud Recharge.

SIGIR 2023 live at Radboud

On 24, 25 and 26 July we will follow the 46th International ACM SIGIR Conference online from lecture hall 0.28 in the Mercator building. We will start each morning at 8:30h. for the live stream from Tapei, Taiwan and watch recorded sessions and keynotes in the afternoon. There will be presentations from well-known Radboud researchers such as Harrie Oosterhuis, Chris Kamphuis and Negin Ghasemi! 😄 

More information at: https://sigir.org/sigir2023/

Follow us on-line: #SIGIR2023