
The Library
Zero-cost labelling with web feeds for weblog data extraction
Tools
Gkotsis, George, Stepanyan, Karen, Cristea, Alexandra I. and Joy, Mike (2013) Zero-cost labelling with web feeds for weblog data extraction. In: 23rd International World Wide Web Conference (WWW 2013), Rio de Janeiro, Brazil, 13-17 May 2013. Published in: WWW '13 Companion : Proceedings of the 22nd international conference on World Wide Web companion pp. 73-74. ISBN 9781450320382.
![]() |
Text
WRAP_Gkotsis_gkotsis_stepanyan_cristea_joy_www_2013.pdf Embargoed item. Restricted access to Repository staff only Download (457Kb) |
Abstract
Data extraction from web pages often involves either human intervention for training a wrapper or a reduced level of granularity in the information acquired. Even though the study of social media has drawn the attention of researchers, weblogs remain a part of the web that cannot be harvested efficiently. In this paper, we propose a fully automated approach in generating a wrapper for weblogs, which exploits web feeds for cheap labelling of weblog properties. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. Our evaluation shows that our approach is robust, accurate and efficient in handling different types of weblogs.
Item Type: | Conference Item (Poster) | ||||
---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software | ||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||
Library of Congress Subject Headings (LCSH): | Blogs, Data mining | ||||
Journal or Publication Title: | WWW '13 Companion : Proceedings of the 22nd international conference on World Wide Web companion | ||||
Publisher: | International World Wide Web Conferences Steering Committee | ||||
ISBN: | 9781450320382 | ||||
Official Date: | 13 May 2013 | ||||
Dates: |
|
||||
Page Range: | pp. 73-74 | ||||
Status: | Peer Reviewed | ||||
Publication Status: | Published | ||||
Access rights to Published version: | Restricted or Subscription Access | ||||
Funder: | Seventh Framework Programme (European Commission) (FP7) | ||||
Grant number: | 269963 (FP7) | ||||
Conference Paper Type: | Poster | ||||
Title of Event: | 23rd International World Wide Web Conference (WWW 2013) | ||||
Type of Event: | Conference | ||||
Location of Event: | Rio de Janeiro, Brazil | ||||
Date(s) of Event: | 13-17 May 2013 | ||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
![]() |
View Item |