We released a new version 0.3 for TREC Web track participants that work on the new ClueWeb12 dataset. The code now uses the new Hadoop API. The code was tested on Cloudera's cdh3u5 Hadoop distribution, Hadoop version 0.20.2, and with some minor tweaks of the build.xml file also on Cloudera cdh4 versions. Download MIREX at:
http://mirex.sourceforge.net.
Category: MIREX
Anchor text for ClueWeb12
We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:
- ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)
The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):
- Djoerd Hiemstra and Claudia Hauff. “MIREX: MapReduce Information Retrieval Experiments” CTIT Technical Report TR-CTIT-10-15, Centre for Telematics and Information Technology, University of Twente, ISSN 1381-3625, 2010 (arXiv preprint 1004.4489)
The source code is available from: http://mirex.sourceforge.net.
(See also Anchor Text for ClueWeb09.)
Ensemble clustering for result diversification
by Dong Nguyen and Djoerd Hiemstra
This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run.
MIREX in ERCIM News Big Data Special
by Djoerd Hiemstra and Claudia Hauff
MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available at SourceForge.
More information in ERCIM News 89.
MIREX: MapReduce IR Experiments
MIREX (MapReduce Information Retrieval Experiments) provides solutions to easily and quickly run large-scale information retrieval experiments on a cluster of machines using Hadoop. Version 0.1 has tools for the TREC ClueWeb09 collection.The code is available to other researchers at: http://mirex.sourceforge.net/.