Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Self-supervised automated wrapper generation for weblog data extraction

Tools
- Tools
+ Tools

Gkotsis, George, Stepanyan, Karen, Cristea, Alexandra I. and Joy, Mike (2013) Self-supervised automated wrapper generation for weblog data extraction. In: 29th British National Conference on Databases, BNCOD 2013, Oxford, 8-10 July, 2013. Published in: Big data, Volume 7968 pp. 292-302. ISBN 9783642394669. doi:10.1007/978-3-642-39467-6_26 ISSN 0302-9743.

[img]
Preview
Text
WRAP_Stepanyan_gkotsis_stepanyan_cristea_joy_bncod13.pdf - Accepted Version

Download (602Kb) | Preview
Official URL: http://dx.doi.org/10.1007/978-3-642-39467-6_26

Request Changes to record.

Abstract

Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives.

Item Type: Conference Item (Paper)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Divisions: Faculty of Science, Engineering and Medicine > Science > Computer Science
Library of Congress Subject Headings (LCSH): Blogs, Data mining
Series Name: Lecture Notes in Computer Science
Journal or Publication Title: Big data
Publisher: Springer Berlin Heidelberg
ISBN: 9783642394669
ISSN: 0302-9743
Editor: Gottlob, G. (Georg) and Grasso, Giovanni and Olteanu, Dan and Schallhart, Christian
Official Date: 2013
Dates:
DateEvent
2013Published
Volume: Volume 7968
Page Range: pp. 292-302
DOI: 10.1007/978-3-642-39467-6_26
Status: Peer Reviewed
Publication Status: Published
Access rights to Published version: Restricted or Subscription Access
Description:
Date of first compliant deposit: 26 December 2015
Date of first compliant Open Access: 26 December 2015
Funder: Seventh Framework Programme (European Commission) (FP7)
Grant number: 269963 (FP7)
Conference Paper Type: Paper
Title of Event: 29th British National Conference on Databases, BNCOD 2013
Type of Event: Conference
Location of Event: Oxford
Date(s) of Event: 8-10 July, 2013

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics

twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us