Wednesday, 11 April 2012

A new term-weighting scheme for naïve Bayes text categorization

an article by Marcelo Mendoza (Universidad Técnica Federico Santa María, Santiago, Chile) published in International Journal of Web Information Systems Volume 8 Issue 1 (2012)

Abstract

Purpose
Automatic text categorization has applications in several domains, for example e-mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naïve Bayes representation of the text. Currently, a number of variations of naïve Bayes have been discussed. The purpose of this paper is to evaluate naïve Bayes approaches on text categorization introducing new competitive extensions to previous approaches.
Design/methodology/approach
The paper focuses on introducing a new Bayesian text categorization method based on an extension of the naïve Bayes approach. Some modifications to document representations are introduced based on the well-known BM25 text information retrieval method. The performance of the method is compared to several extensions of naïve Bayes using benchmark datasets designed for this purpose. The method is compared also to training-based methods such as support vector machines and logistic regression.
Findings
The proposed text categorizer outperforms state-of-the-art methods without introducing new computational costs. It also achieves performance results very similar to more complex methods based on criterion function optimization as support vector machines or logistic regression.
Practical implications
The proposed method scales well regarding the size of the collection involved. The presented results demonstrate the efficiency and effectiveness of the approach.
Originality/value
The paper introduces a novel naïve Bayes text categorization approach based on the well-known BM25 information retrieval model, which offers a set of good properties for this problem.


No comments: