STW grant for StructWeb

StructWeb Wim Korevaar received a valorization grant from the Dutch Technology Foundation STW for his proposal StructWeb: Structuring the Web for Organizations. The concept is based on an innovative information system developed on the basis of the latest insights on search technology and making use of an intuitive user interface: StructWeb. The new technology will be used to help businesses and organizations to structure their vast information resources and make it more easy for their staff and clients to access them.

More information at:

Saving the Old IR Literature

The SIGIR project Saving the Old IR Literature has scanned and released a new batch of historic IR (Information Retrieval) papers, including early papers on the SMART system and papers on the development of test collections. The papers are written by amongst others: Gerard Salton, Karen Sparck Jones, William Cooper, Keith van Rijsbergen, Stepen Robertson, Martin Kay, Michael Lesk, and Nicolas Belkin. The new batch is listed below and available from the SIGIR web site.

The collection contains some unique documents, for instance Karen Sparck Jones' and Keith van Rijsbergen's Report on the Need for and Provision for an 'IDEAL' Information Retrieval Test Collection written in 1975, which I anxiously searched for when doing my Ph.D. research. The document is an important mile stone towards the current TREC conferences; work that already started in 1960 with Cyril Cleverdon's Cranfield experiments, one of Computer Science's earliest examples of empirical testing in a laboratory setting.

It's all there, enjoy!

Open source alternatives for Blackboard?

Starting in 2009, the University of Twente uses Blackboard as on-line learning management system. However, Blackboard turns out to be very insecure; see for instance the news item (in Dutch) Universiteitssoftware blijkt langdurig lek. Among other things, it is not only possible but actually easy for students to hack into a teacher's account and invisibly change grades. As it turns out, this has been known amongst our students for quite some time.

Blackboard is a commercial system and its internals are a company secret. Kerckhoff's Principle states that a secure system must not require secrecy. This way, it can be stolen by the enemy without causing trouble. In the design of software systems, this argument is used in favour of open source software security: Security through obscurity is considered bad practice, see for instance Jaap-Henk Hoepman and Bart Jacobs' Communications of the ACM article Increased security through open source (CACM 50-1, 2007). So, maybe it is time to look at some of the open source alternatives out there, such as Sakai or Moodle. Both come with commercial support, in case our technical university does not want to invest in the expertise to deploy such a system in-house.

Keith van Rijsbergen retired

Keith van Rijsbergen is retiring this year. To celebrate his long successful career, you can download his book “Information Retrieval” in the popular epub format, an open format that is supported by most e-readers.


Since the publication in 1976 of the first edition of Van Rijsbergen's book, it has established itself as a classic. The book gives a thorough introduction to “automatic ranked” retrieval, which today forms the basis of web search engines, but at that time was still highly experimental. The book covers all important information retrieval topics, but it is Van Rijsbergen's personal view on information retrieval that makes the book so different from other scientific books on information retrieval: The book is written in the first person, a writing style I would normally not recommend for scientific documents. In this book, however, Van Rijsbergen's personal style of writing inspired me a lot. Maybe it is his undisputed expertise, maybe it is his critical analysis of the work of others, or maybe it is merely his enthousiastic account of science, whatever it is, it is a pleasure to read the book, even almost 35 years after its first publications. Here is a nice example, where Van Rijsbergen's shares his view on significance tests:

Keith van RijsbergenUnfortunately, I have to agree with the findings of the Comparative Systems Laboratory in 1968, that there are no known statistical tests applicable to IR. This may sound like a counsel of defeat but let me hasten to add that it is possible to select a test which violates only a few of the assumptions it makes.

His analysis let me to use the paired sign test in my PhD thesis, and I motivated this by adding that Van Rijsbergen says I am allowed to do so. (Actually, he claims I am allowed to do so only conservatively, because some of the test's assumptions are not met…) The book is also a no-nonsense book in many respects, with many practical approaches that are directly applicable. In several of our experiments, we used the stop word list printed in the book (see Table 2.1). This is science in its best form. Experiments should be easily reproducible, and what is more easy than the usage of a officially published stop word list?

So, if you are still looking for a good, personal, entertaining, no-nonsense, scientific book on information retrieval to be read by the pool during the holidays, please consider Information Retrieval. No e-reader yet? Then you can read the ebook using the EPUBReader Firefox addon.

[download epub]

Ralf Schimmel graduates on keyword suggestion

Keyword Suggestion for Search Engine Marketing

by Ralf Schimmel

Every person acquainted with the web, is also a frequent user of search engines like Yahoo and Google. Any person with a web site makes this web site with a vision in mind, most of the times this entails being found on the web. Search engines offer several methods to users that help them to be found. One group of the techniques used in this field is Search Engine Optimization (SEO), which covers everything that can be done to optimize a web site for the search engine. The whole idea of SEO is to ensure that a web site is listed in the set of search results once a matching query is entered by a user. A second important part of the search engines is Search Engine Advertisement (SEA). Billions of dollars are paid by companies that bid on keywords that match their advertisements to a users query. These keywords are hard to find, of course a company knows what it sells, but it does not know how the users search for the same products or services. Advertising in search engines can be done in multiple ways. The focus of this research lies in finding many long-tail keywords, words that often have a low search volume, but which are cheap (low competition) and which are often specific enough to ensure high conversion rates (a visitor becomes a customer). Several keyword suggestion techniques are researched and evaluated for practical use. One applicable technique is chosen, implemented and evaluated. The chosen technique is a web based technique which is using an undirected weighted graph of candidate terms (nodes), where the weight of the vertices is the semantic similarity between the two nodes, and where the term frequency of the term is stored in the node. The evaluation shows that it is a technique capable of suggesting a lot of relevant keywords that can be used for search engine marketing. According to the evaluation the technique is capable of using the term frequencies and the semantic similarities to find and rank suggestions based on popularity and relevance. The most important conclusion is that, for single term suggestions, the system outperforms Google's suggestion system. Google's precision on single term suggestions is better then the precision of the new tool, however the relative recall of Google is a lot worse, for both obvious and non-obvious single term suggestions. Currently the tool can only be used to complement Google's tool, however once extended with support for multi term suggestions it can replace the entire system.

[download pdf]

Searching in the free world

Google faced a cyber attack originating from computers in China, that was serious enough to send an ultimatum to the Chinese government:

…We have decided we are no longer willing to continue censoring our results on, and so over the next few weeks we will be discussing with the Chinese government the basis on which we could operate an unfiltered search engine within the law, if at all…

See: Google's blog.

Sander Bockting wint ENIAC scriptieprijs

Sander Bockting heeft dit jaar de ENIAC scriptieprijs gewonnen. ENIAC is de de alumnivereniging voor oud-studenten van Informatica, Bedrijfsinformatietechnologie en Telematica. ENIAC reikt elk jaar een prijs uit voor de beste afstudeerscriptie. Het juryrapport luidt:

De jury heeft besloten de ENIAC scriptieprijs 2009 toe te kennen aan de scriptie “Collection Selection for Distributed Web Search: Using Highly Discriminative Keys, Query-driven Indexing and ColRank”, van Sander Bockting. De jury heeft gekozen voor deze scriptie, vanwege de relevantie van het onderzoek, de wetenschappelijke benadering en het grote deel 'ontwerp' (het prototype Sophos) dat in het werk besloten ligt. Hiernaast biedt Sanders onderzoek een (mogelijk) antwoord op het toegankelijke houden van het internet. Zoeken op internet en de bijbehorende zoekmachines vervullen een maatschappelijke functie in het ontsluiten van informatie. Door de sterke groei van het internet is het echter onmogelijk om het gehele internet centraal te blijven indexeren. Tevens geeft deze methode veel macht aan de eigenaren van enkele centrale zoekmachines. Sander laat zien dat het toepassen van gedistribueerde zoeksystemen een veelbelovende aanpak is, die in potentie gegevens beter ontsluit terwijl de afhankelijkheid van enkele centrale zoekmachines afneemt. De vijf door hem vergelijken technieken zijn dan ook een prima basis voor maatschappelijk en wetenschappelijk relevant vervolgonderzoek.

Searching in the 21st Century

Information retrieval (IR) can be defined as the process of representing, managing, searching, retrieving, and presenting information. Good IR involves understanding information needs and interests, developing an effective search technique, system, presentation, distribution and delivery. The increased use of the Web and wider availability of information in this environment led to the development of Web search engines. This change has brought fresh challenges to a wider variety of users’ needs, tasks, and types of information. Today, search engines are seen in enterprises, on laptops, in individual websites, in library catalogues, and elsewhere. Information Retrieval: Searching in the 21st Century focuses on core concepts, and current trends in the field. This book focuses on:

  • Information Retrieval Models
  • User-centred Evaluation of Information Retrieval Systems
  • Multimedia Resource Discovery
  • Image Users’ Needs and Searching Behaviour
  • Web Information Retrieval
  • Mobile Search
  • Context and Information Retrieval
  • Text Categorisation and Genre in Information Retrieval
  • Semantic Search
  • The Role of Natural Language Processing in Information Retrieval: Search for Meaning and Structure
  • Cross-language Information Retrieval
  • Performance Issues in Parallel Computing for Information Retrieval

This book is an invaluable reference for graduate students on IR courses or courses in related disciplines (e.g. computer science, information science, human-computer interaction, and knowledge management), academic and industrial researchers, and industrial personnel tracking information search technology developments to understand the business implications. Intermediate-advanced level undergraduate students on IR or related courses will also find this text insightful. Chapters are supplemented with exercises to stimulate further thinking.

More information at Wiley.

Susan Dumais won the Salton award

Susan DumaisSue Dumais won the Salton award, and gave a terrific keynote talk at the SIGIR Conference in Boston entitled “An Interdisciplinary Perspective on Information Retrieval”. Susan was awarded for “nearly thirty years of significant, sustained, and continuing contributions to research, for exceptional mentorship, and for leadership in bridging the fields of information retrieval and human computer interaction. Her contributions to both the theoretical development and practical implementations of Latent Semantic Indexing, question-answering, desktop search, combining search and navigation, and incorporating the user and their context, have all substantially advanced and enriched the field of Information Retrieval.”

More info at ACM SIGIR.

Opinion mining by Vox-Pop

Vox-Pop ( provides funny gadgets that show the power of so-called opinion mining or sentiment analysis. The site uses natural language processing tools find named entities (person names), and to detect if the individual was mentioned in a positive, negative or neutral way. See below what the “vox populi” think of Dutch football players and trainers. But… why don't they mention Blaise N'Kufo?