
The Library
Entropy-based automated wrapper generation for weblog data extraction
Tools
Gkotsis, George, Stepanyan, Karen, Cristea, Alexandra I. and Joy, Mike (2014) Entropy-based automated wrapper generation for weblog data extraction. World Wide Web, Volume 17 (Number 4). 827-846 . doi:10.1007/s11280-013-0269-6 ISSN 1386-145X.
|
PDF
WRAP_gkotsis_stepanyan_cristea_joy_wwwj_2013_final.pdf - Accepted Version - Requires a PDF viewer. Download (1640Kb) | Preview |
|
![]() |
PDF
gkotsis_stepanyan_cristea_joy_wwwj_2013.pdf - Published Version Embargoed item. Restricted access to Repository staff only - Requires a PDF viewer. Download (986Kb) |
Official URL: http://dx.doi.org/10.1007/s11280-013-0269-6
Abstract
This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.
Item Type: | Journal Article | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software Z Bibliography. Library Science. Information Resources > ZA Information resources Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources |
||||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||||||||
Library of Congress Subject Headings (LCSH): | Web usage mining, Data mining, Web search engines, Blogs, Blogs -- Data processing | ||||||||||
Journal or Publication Title: | World Wide Web | ||||||||||
Publisher: | Springer US | ||||||||||
ISSN: | 1386-145X | ||||||||||
Official Date: | July 2014 | ||||||||||
Dates: |
|
||||||||||
Volume: | Volume 17 | ||||||||||
Number: | Number 4 | ||||||||||
Page Range: | 827-846 | ||||||||||
DOI: | 10.1007/s11280-013-0269-6 | ||||||||||
Status: | Peer Reviewed | ||||||||||
Publication Status: | Published | ||||||||||
Access rights to Published version: | Restricted or Subscription Access | ||||||||||
Date of first compliant deposit: | 27 December 2015 | ||||||||||
Date of first compliant Open Access: | 27 December 2015 | ||||||||||
Funder: | Seventh Framework Programme (European Commission) (FP7) | ||||||||||
Grant number: | 269963 (FP7) | ||||||||||
Embodied As: | 1 |
Request changes or add full text files to a record
Repository staff actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year