MIREX 0.3 for ClueWeb12

MIREX 0.3 We released a new version 0.3 for TREC Web track participants that work on the new ClueWeb12 dataset. The code now uses the new Hadoop API. The code was tested on Cloudera's cdh3u5 Hadoop distribution, Hadoop version 0.20.2, and with some minor tweaks of the build.xml file also on Cloudera cdh4 versions. Download MIREX at:
http://mirex.sourceforge.net.

Anchor text for ClueWeb12

We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:

  • ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)

The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):

The source code is available from: http://mirex.sourceforge.net.
(See also Anchor Text for ClueWeb09.)

Ensemble clustering for result diversification

by Dong Nguyen and Djoerd Hiemstra

This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run.

[download pdf]

MIREX in ERCIM News Big Data Special

by Djoerd Hiemstra and Claudia Hauff

ERCIM News 89 MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available at SourceForge.

More information in ERCIM News 89.