WikiLink: Anchor Detection and Link Generation in Wikiâ€™s
by Remko Nolten
In this research we try to automate the process of link generation in Wikiâ€™s by looking at existing link generation techniques and enhancing these with our own ideas. We started the research by analyzing a large document corpus to find out more about the links we want to create. In our analysis we looked into three aspects of our datasets. First, we wanted to know more about the relation between the text that is used to display the link and the title of the page where the link points to. We showed that a large majority of the links could theoretically be identified by matching the text of the link with the page title of the appropriate page, but we also identified several problems with this approach. Second, we wanted to learn more about the existing link structure in our dataset. Here, we confirmed most advantages and disadvantages of using existing links in a link generation algorithm that were also identified by other studies. Finally, we decided to analyze the grammatical structure of links, to see if we could use this later on. Our analysis showed that a very large majority of the links were nouns or noun phrases, which suggests that this would be a good way to identify links in a text.
Based on the results of this analysis, we built a framework in which we could implement new and existing methods for link generation. In the framework, the process of â€˜anchor detectionâ€™ (the technique of discovering phrases in a larger text that could be used a basis for a link) and â€˜destination findingâ€™ (the process of finding a suitable destination page for a short piece of text) where separated. This way we could try multiple combinations to see which would work best. Using this framework, we found that our grammar based anchor detection algorithm combined with multiple destination finding algorithms resulted in the best performance. Our final performance figures were better than most competitors which showed the potential of our techniques.