Profile
Publication details for Professor Alexandra Cristea
Gkotsis, George Stepanyan, Karen Cristea, A. I. & Joy, Mike (2014). Entropy-based automated wrapper generation for weblog data extraction. World Wide Web 17(4): 827-846- Publication type: Journal Article
- ISSN/ISBN: 1386-145X, 1573-1413
- DOI: 10.1007/s11280-013-0269-6
- Further publication details on publisher web site
- Durham Research Online (DRO) - may include full text
- View in another repository - may include full text
Author(s) from Durham
Abstract
This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.