Updating broken web links: An automatic recommendation system

an article by Juan Martinez-Romo and Lourdes Araujo (Dpto. Lenguajes y Sistemas Informáticos, NLP & IR Group, UNED, Madrid) published in Information Processing & Management Volume 48 Issue 2 (March 2012)


Broken hypertext links are a frequent problem in the Web.
Sometimes the page which a link points to has disappeared forever, but in many other cases the page has simply been moved to another location in the same web site or to another one. In some cases the page besides being moved, is updated, becoming a bit different to the original one but rather similar.
In all these cases it can be very useful to have a tool that provides us with pages highly related to the broken link, since we could select the most appropriate one. The relationship between the broken link and its possible linkable pages, can be defined as a function of many factors.

In this work we have employed several resources both in the context of the link and in the Web to look for pages related to a broken link. From the resources in the context of a link, we have analyzed several sources of information such as the anchor text, the text surrounding the anchor, the URL and the page containing the link.

We have also extracted information about a link from the Web infrastructure such as search engines, Internet archives and social tagging systems.

We have combined all of these resources to design a system that recommends pages that can be used to recover the broken link.

A novel methodology is presented to evaluate the system without resorting to user judgments, thus increasing the objectivity of the results, and helping to adjust the parameters of the algorithm. We have also compiled a web page collection with true broken links, which has been used to test the full system by humans.

Results show that the system is able to recommend the correct page among the first ten results when the page has been moved, and to recommend highly related pages when the original one has disappeared.

