2017-05-24 21:16:28

FEATuRE - Features gEnerator based on AssociaTion RulEs

About

A good textual document representation is essential for Text Mining techniques. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as ``document engineering'' or ``text mining''. The bag-of-related-words was developed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features can be used. A tool named FEATuRE was developed to generate the bag-of-related-words representation. 

Technical Report Describing the Tool 

Download Tool 

Download Source Code

 


Related research

ROSSI, R. G. ; REZENDE, S. O. Generating Features from Textual Documents Through Association Rules. In ENIA: Encontro Nacional de Inteligência Artificial, 2011, Natal. Encontro Nacional de Inteligência Artificial. Porto Alegre - RS: Sociedade Brasileira de Computação, 2011. v. 1. p. 311-322. 

ROSSI, R. G. ; REZENDE, S. O. Building a Topic Hierarchy Using the Bag-of-Related-Words Representation. In DocEng: 11th ACM Symposium on Document Engineering, 2011, Mountain View, Califórnia, EUA. Proceedings of hte 2011 ACM Symposium on Document Engineering. New York, EUA: ACM, 2011. v. 1. p. 195-204. 

Neto, A. T. ; Fortes, R. P. M. ; ROSSI, R. G. ; REZENDE, S. O. MMWA-ae: boosting knowledge from Multimodal Interface Design, Reuse and Usability Evaluation. In SEKE: The 22nd International Conference on Software Engineering and Knowledge Engineering, 2010, Redwood City, San Francisco. SEKE 2010, 2010. v. 1. p. 355-360. 

ROSSI, R. G. ; REZENDE, S. O. The use of frequent itemsets extracted from textual documents for the classification task. In WTI: III International Workshop on Web and Text Intelligence, 2010, São Bernado do Campo. Joint Conference 2010, 2010. p. 846-855. 

ROSSI, R. G. ; REZENDE, S. O. FEATuRE - Ferramenta para a geração da representação Bag-of-Related-Words. Technical Report nº 367. Instituto de Ciências Matemáticas e Computação. 2011.

 


Datasets

Datasets used in the publications: 
ACM-1
ACM-2
ACM-3
ACM-4
ACM-5
ACM-6
ACM-7
ACM-8
Reuters

 


Developed by Rafael Geraldeli Rossi. 

Atenção! Conteúdo original hospedado em: http://sites.labic.icmc.usp.br/feature.