Effect of Part-of-Speech and Lemmatization Filtering in Email Classification for Automatic Reply
Abstract
We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and Support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with non-lemmatized selection of verbs and nouns, adjectives and adverbs was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.
BibTeX
@workshop{Bonatti-2016-124292,author = {Rogerio Bonatti and Arthur G. de Paula and Victor S. Lamarca and Fabio Gagliardi Cozman},
title = {Effect of Part-of-Speech and Lemmatization Filtering in Email Classification for Automatic Reply},
booktitle = {Proceedings of AAAI '16 Knowledge Extraction from Text Workshop},
year = {2016},
month = {February},
pages = {496 - 501},
}