Anchor text for ClueWeb12

We are happy to share the anchor text extracted from the TREC ClueWeb12 collection:

  • ClueWeb12_Anchors (30.4 GB; use a BitTorrent client; please seed until you reach a reasonable share ratio)

The data contains anchor text for 0.5 billion pages, about 64% of the total number of pages in ClueWeb12. The text is cut after more than 10MB of anchors have been collected for one page to keep the file manageable. Web pages were truncated at 50KB before extracting the anchors. The size is about 30.4 GB (gzipped). The data consists of a tab-separated text files consisting of (TREC-ID, URL, ANCHOR TEXT) The anchor text extraction is described in (please cite the report if you use the data in your research):

The source code is available from: http://mirex.sourceforge.net.
(See also Anchor Text for ClueWeb09.)

One thought on “Anchor text for ClueWeb12”

  1. Djoerd Hiemstra says:

    Dear Janek, Sorry for the late reply, this was sent exactly when I left for holidays. I think it is on-line again, can you please check? Best wishes, Djoerd.

Comments are closed.