We've put anchor text for the English Category A documents of the TREC ClueWeb09 collection on line using BitTorrent:
- ClueWeb09_Anchors (24.5 GB; please seed until you reach a reasonable share ratio)
The file contains anchor text for about 88% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 24.5 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):
- Djoerd Hiemstra and Claudia Hauff. “MIREX: MapReduce Information Retrieval Experiments” CTIT Technical Report TR-CTIT-10-15, Centre for Telematics and Information Technology, University of Twente, ISSN 1381-3625, 2010 (arXiv preprint 1004.4489)
The source code is available from: http://mirex.sourceforge.net
The Category B anchors mentioned in the comments above are no longer available.