Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Weakly supervised POS tagging without disambiguation

Tools
- Tools
+ Tools

Zhou, Deyu, Zhang, Zhikai, Zhang, Min-Ling and He, Yulan (2018) Weakly supervised POS tagging without disambiguation. ACM Transactions on Asian and Low-Resource Language Information Processing, 17 (4). 35. doi:10.1145/3214707

[img]
Preview
PDF
WRAP-weakly-supervised-part-of-speech-tagging-He-2018.pdf - Accepted Version - Requires a PDF viewer.

Download (790Kb) | Preview
Official URL: http://dx.doi.org/10.1145/3214707

Request Changes to record.

Abstract

Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data instead of the fully tagged corpora. Weakly supervised POS tagging would benefit various natural language processing applications in such languages where tagged corpora are mostly unavailable.

In this article, we propose a novel framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags. In the constrained error-correcting output codes (ECOC)-based approach, a unique L-bit vector is assigned to each POS tag. The set of bitvectors is referred to as a coding matrix with value { 1, -1}. Each column of the coding matrix specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its training data is generated in the following way: each pair of words and its possible POS tags are considered as a positive training example only if the whole set of its possible tags falls into the positive dichotomy specified by the column coding and similarly for negative training examples. Given a word in context, its POS tag is predicted by concatenating the predictive outputs of the L binary classifiers and choosing the tag with the closest distance according to some measure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety without the need of performing disambiguation. Moreover, instead of manual feature engineering employed in most previous POS tagging approaches, features for training and testing in the proposed framework are automatically generated using neural language modeling. The proposed framework has been evaluated on three corpora for English, Italian, and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9%, and 84.5% individually, which shows a significant improvement compared to the state-of-the-art approaches.

Item Type: Journal Article
Subjects: P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Divisions: Faculty of Science > Computer Science
Library of Congress Subject Headings (LCSH): Natural language processing (Computer science), Error-correcting codes (Information theory), Parts of speech, Corpora (Linguistics)
Journal or Publication Title: ACM Transactions on Asian and Low-Resource Language Information Processing
Publisher: ACM
ISSN: 2375-4699
Official Date: August 2018
Dates:
DateEvent
August 2018Published
11 March 2018Accepted
Volume: 17
Number: 4
Article Number: 35
DOI: 10.1145/3214707
Status: Peer Reviewed
Publisher Statement: © ACM 2018. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Transactions on Asian and Low-Resource Language Information Processing, http://dx.doi.org/10.1145/10.1145/3214707
Access rights to Published version: Restricted or Subscription Access
RIOXX Funder/Project Grant:
Project/Grant IDRIOXX Funder NameFunder ID
61528302[NSFC] National Natural Science Foundation of Chinahttp://dx.doi.org/10.13039/501100001809
BK20161430Natural Science Foundation of Jiangsu Provincehttp://dx.doi.org/10.13039/501100004608
UNSPECIFIEDCollaborative Innovation Centre of Wireless Communications TechnologyUNSPECIFIED

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics

twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us