Anchor text for ClueWeb09 Category A

We've put anchor text for the English Category A documents of the TREC ClueWeb09 collection on line using BitTorrent:

The file contains anchor text for about 88% of the pages in Category A. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. The size is about 24.5 GB (gzipped). The file is a tab-separated text file consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):

The source code is available from: http://mirex.sourceforge.net

One thought on “Anchor text for ClueWeb09 Category A”

Comments are closed.