Word co-occurrence features for text classification期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Word co-occurrence features for text classification

Authors:	Fá bio Figueiredo,Leonardo Rocha,Thierson Couto,Thiago Salles,Marcos André Gonç alves,Wagner Meira Jr.

Affiliation:	1. EconoInfo Research, Belo Horizonte, Brazil;2. Universidade Federal de Minas Gerais, Computer Science Department, Belo Horizonte, Brazil;3. Universidade Federal de São João Del Rei, Computer Science Department, São João Del Rei, Brazil;4. Universidade Federal de Goiás, Institute of Informatics, Goiânia, Brazil

Abstract:	In this article we propose a data treatment strategy to generate new discriminative features, called compound-features (or c-features), for the sake of text classification. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. The idea is that, when c-features are used in conjunction with single-features, the ambiguity and noise inherent to their bag-of-words representation are reduced. We use c-features composed of two terms in order to make their usage computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and single-label multi-class text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as kNN (13% gain in micro-average F₁ in the 20 Newsgroups collection) to the most complex one, the state-of-the-art SVM (10% gain in macro-average F₁ in the collection OHSUMED).

Keywords:	Classification Text mining Feature extraction
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏