Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Entropy-based automated wrapper generation for weblog data extraction

Tools
- Tools
+ Tools

Gkotsis, George, Stepanyan, Karen, Cristea, Alexandra I. and Joy, Mike (2014) Entropy-based automated wrapper generation for weblog data extraction. World Wide Web, Volume 17 (Number 4). 827-846 . doi:10.1007/s11280-013-0269-6

[img]
Preview
PDF
WRAP_gkotsis_stepanyan_cristea_joy_wwwj_2013_final.pdf - Accepted Version - Requires a PDF viewer.

Download (1640Kb) | Preview
[img] PDF
gkotsis_stepanyan_cristea_joy_wwwj_2013.pdf - Published Version
Embargoed item. Restricted access to Repository staff only - Requires a PDF viewer.

Download (986Kb)
Official URL: http://dx.doi.org/10.1007/s11280-013-0269-6

Request Changes to record.

Abstract

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

Item Type: Journal Article
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Z Bibliography. Library Science. Information Resources > ZA Information resources
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources
Divisions: Faculty of Science > Computer Science
Library of Congress Subject Headings (LCSH): Web usage mining, Data mining, Web search engines, Blogs, Blogs -- Data processing
Journal or Publication Title: World Wide Web
Publisher: Springer US
ISSN: 1386-145X
Official Date: July 2014
Dates:
DateEvent
July 2014Published
21 November 2013Available
4 November 2013Accepted
31 October 2012Submitted
Volume: Volume 17
Number: Number 4
Page Range: 827-846
DOI: 10.1007/s11280-013-0269-6
Status: Peer Reviewed
Publication Status: Published
Access rights to Published version: Restricted or Subscription Access
Funder: Seventh Framework Programme (European Commission) (FP7)
Grant number: 269963 (FP7)
Embodied As: 1

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item
twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us