FEATuRE - Features gEnerator based on AssociaTion RulEs


A good textual document representation is essential for Text Mining techniques. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as ``document engineering'' or ``text mining''. The bag-of-related-words was developed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features can be used. A tool named FEATuRE was developed to generate the bag-of-related-words representation. 

