We use cookies to ensure that we give you the best experience on our website. You can change your cookie settings at any time. Otherwise, we'll assume you're OK to continue.

Durham University

Computer Science


Publication details for Professor Alexandra Cristea

Gkotsis, George Stepanyan, Karen Cristea, A. I. & Joy, Mike (2014). Entropy-based automated wrapper generation for weblog data extraction. World Wide Web 17(4): 827-846

Author(s) from Durham


This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.