Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Non-standardized Texts

Authors

Abstract

In this work we consider the problem of social media text Part-of-Speech tagging as fundamental task for Natural Language Processing. We present improvements to a social media Markov model tagger, by adapting parameter estimation methods for unknown tokens. In addition, we propose to enrich the social media text corpus by a linear combination with a newspaper training corpus. Applying our tagger to a social media text corpus results in accuracies of around 94.8%, which comes close to accuracies for standardized texts.

BibT_EX Reference Entry

@inproceedings{NeReMa14b,
	author = {Melanie Neunerdt and Michael Reyer and Rudolf Mathar},
	title = "Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Non-standardized Texts",
	pages = "186-192",
	booktitle = "12th Conference on Natural Language Processing (KONVENS)",
	address = {Hildesheim, Germany},
	month = Oct,
	year = 2014,
	hsb = hsb999910363741 ,
	}

Downloads

_{Download paper} _{Download bibtex-file}

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights there in are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

*** Aktuelle Informationen gemäß Art. 13 DS-GVO: Datenschutzhinweis *** Impressum ***

Institute for Theoretical Information Technology

Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Non-standardized Texts

Authors

Abstract

BibTEX Reference Entry

Downloads

BibT_EX Reference Entry