2017-05-24 20:33:20 - Atualizado em 2017-05-24 21:07:06

TaXEm - a tool for helping evaluate domain topics

TaXEm tool

The notorious advances of textual information storage need fast and efficient tools to organize, retrieve, browse this information and to extract relevant knowledge. A very interesting way to organize specific domain information is the construction of topic taxonomies. Moreover, an important challenge in this research area is the evaluation and validation of the results. This evaluation can be carried out through objective measures or through subjective analysis, which is based on the domain specialist judgment. However, the human evaluation is expensive, because this task involves knowledge, time and dedication from the specialists. In this way, the TaXEm tool claims to reduce the subjective evaluation costs. The TaXEm (Taxonomia em XML da Embrapa) tool offers functionalities for carrying out a taxonomy (semi)automatic evaluation, which allows the user to implement some automatic evaluation before going on a subjective evaluation. TaXEm generates the xml file that corresponds to a taxonomy of an specific domain and it also organizes the texts of this domain. For this reason, only the sons close to a node are considered as sons of this node.


Download Tool: TaXEm tool

Download related papers:

  1. "TaXEm: a tool for aiding the evaluation of domain topic" (Poster) 
  2. "Facilitando a Avaliação de Taxonomias de Tópicos Automaticamente Geradas no Domínio do Agronegócio" (Demo) 

TaXEm tool has been developed and tested only on Windows. You must have Java Runtime Environment (JRE 1.6) installed on your computer.

How to use

    • Configuration:
    • In the same folder where is the TaXEm.jar, it is necessary to contain the folder data with a configuration file named arquivoraiz.txt. This configuration file must be filled in this order:
      1. The product which will be processed on the taxonomy. This product must be spelled exactly the same as the files folder of the product (Product =).
      2. The folder in which you want to be written the result (dirsaida =).
      3. The folder where the HTMLs input files are stored. These files contain information about the product being processed and they should be in a predetermined format (dirhtml =). To view this format, click here.
      4. Specify if the taxonomy labels must be expanded with the expanded vocabulary according to Thesagro. To expand the taxonomy labels with Thesagro, you should set expandirRotulos = true, otherwise, you should set expandirRotulos = false.
      5. If you chose expandirRotulos = true, you must specify the file folder of the vocabulary to be used for such an expansion of the taxonomy labels (dir_vocabulario =).
      6. Specify the root folder of the taxonomy for the processed product. To do this, you should use the name of the product before the folder specification as shown in the Line 6 of the example.

      Example configuration file:

      Line 1:                produto=feijao
      Line 2:                dirsaida=C:\arvoreSite\feijao
      Line 3:                dirhtml=C:\maisdetalhesHTML
      Line 4:                expandirRotulos=true
      Line 5:                dir_vocabulario=C:\arvoreSite\feijao\vocabulario_feijao.all
      Line 6:                feijao=C:\arvoreSite\feijao\arvore\arquivo_raiz.html


      • Execution:

      After editing the configuration file, you should run the TaXEm.jar in the DOS prompt with the following command:

      java -jar TaXEm.jar

      • Outputs:
      • Using the example above , the folder should be: arq = C:/arvoreSite/feijao/. All outputs will be generated in the folder: arq/resultados/ 
        1. An Xml file that corresponds to the taxonomy. The labels of this taxonomy may or may not be expanded with the synonyms of Thesagro (depending on the user's configuration).
          Example: arq/resultados/feijao_final.xml
          Here is an example of a taxonomy before and after of the expansion terms: 
        2. A file containing the names of the Text Base Conteudo.
          Example: arq/resultados/feijao_arqRecursos.xml
        3. In the folder DIR = BT_conteudo_final will be placed the textual dataset (arquivos.txt) whose data were removed from their own web page of each node of the taxonomy and probably these files are NOT cataloged.
          Example: arq/resultados/BT_conteudo_final/
        4. In the folder DIR = BT_maisDetalhes is the text base (arquivos.txt) called maisDetalhes.
          The web page of each node of the taxonomy probably has a link MormaisDetalhes that links to another page with the node information. These pages form the files named maisDetalhes and probably they are cataloged.
          Example: arq/resultados/BT_maisDetalhes/


Questions, problems, suggestions? Send me an e-mail: merleyc at icmc usp br

This work was conducted with financial support from CNPq and institutional support from ICMC-USP and Embrapa.
Thanks to all the collaborators, specially to Maria Fernanda Moura and Solange Oliveira Rezende.

Atenção! Conteúdo original hospedado em: http://sites.labic.icmc.usp.br/merleyc/TaXEm/.