Effect of Part-of-Speech and Lemmatization Filtering in Email Classification for Automatic Reply

Rogerio Bonatti, Arthur G. de Paula, Victor S. Lamarca, and Fabio Gagliardi Cozman

Workshop Paper, AAAI '16 Knowledge Extraction from Text Workshop, pp. 496 - 501, February, 2016

Abstract

We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and Support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with non-lemmatized selection of verbs and nouns, adjectives and adverbs was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.

BibTeX

@workshop{Bonatti-2016-124292,
author = {Rogerio Bonatti and Arthur G. de Paula and Victor S. Lamarca and Fabio Gagliardi Cozman},
title = {Effect of Part-of-Speech and Lemmatization Filtering in Email Classification for Automatic Reply},
booktitle = {Proceedings of AAAI '16 Knowledge Extraction from Text Workshop},
year = {2016},
month = {February},
pages = {496 - 501},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.