Enhanced Web Page Cleaning for Constructing Social Media Text Corpora

Authors

M. Neunerdt, E. Reimer, M. Reyer, R. Mathar,

Abstract

        Web page cleaning is one of the most essential tasks in Web corpus construction. The intention is to separate the main content from navigational elements, templates, and advertisements, often referred to as boilerplate. In this paper, we particularly enhance Web page cleaning applied to pages containing comments and introduce a new training corpus for that purpose. Beside extending an existing boilerplate detection algorithm by means of a comment classifier, we train and test different classifiers on extended feature sets solving a two-class problem (content vs. boilerplate) on our and an existing benchmark corpus. Results show that the proposed approach outperforms existing methods, particularly on comment pages from different domains. Finally, we point out that our trained classifiers are domain independent and with small adjustments only transferable to other languages.

BibTEX Reference Entry 

@inproceedings{NeReReMa15,
	author = {Melanie Neunerdt and Eva Reimer and Michael Reyer and Rudolf Mathar},
	title = "Enhanced Web Page Cleaning for Constructing Social Media Text Corpora",
	pages = "1-8",
	booktitle = "6th International Conference on Information Science and Applications (ICISA)",
	address = {Pattaya, Thailand},
	doi = 10.1007/978-3-662-46578-3{\_}78,
	month = Feb,
	year = 2015,
	hsb = RWTH-2015-01482,
	}

Downloads

 Download bibtex-file

Sorry, this paper is currently not available for download.