Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Non-standardized Texts

Authors

M. Neunerdt, M. Reyer, R. Mathar,

Abstract

        In this work we consider the problem of social media text Part-of-Speech tagging as fundamental task for Natural Language Processing. We present improvements to a social media Markov model tagger, by adapting parameter estimation methods for unknown tokens. In addition, we propose to enrich the social media text corpus by a linear combination with a newspaper training corpus. Applying our tagger to a social media text corpus results in accuracies of around 94.8%, which comes close to accuracies for standardized texts.

BibTEX Reference Entry 

@inproceedings{NeReMa14b,
	author = {Melanie Neunerdt and Michael Reyer and Rudolf Mathar},
	title = "Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Non-standardized Texts",
	pages = "186-192",
	booktitle = "12th Conference on Natural Language Processing (KONVENS)",
	address = {Hildesheim, Germany},
	month = Oct,
	year = 2014,
	hsb = hsb999910363741 ,
	}

Downloads

 Download paper  Download bibtex-file

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights there in are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.