Anchor text for ClueWeb12 – Djoerd Hiemstra

We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:

ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)

The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):

Djoerd Hiemstra and Claudia Hauff. “MIREX: MapReduce Information Retrieval Experiments” CTIT Technical Report TR-CTIT-10-15, Centre for Telematics and Information Technology, University of Twente, ISSN 1381-3625, 2010 (arXiv preprint 1004.4489)

The source code is available from: http://mirex.sourceforge.net.
(See also Anchor Text for ClueWeb09.)

One thought on “Anchor text for ClueWeb12”