AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS

Authors

  • Alexandre Ribeiro Afonso
  • Cláudio Gottschalg Duque

DOI:

https://doi.org/10.4301/10.4301%252FS1807-17752014000200011

Keywords:

Text Mining, Text Clustering, Natural Language Processing, Brazilian Portuguese, Effectiveness.

Abstract

This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.

Downloads

Download data is not yet available.

Author Biographies

  • Alexandre Ribeiro Afonso
    Alexandre Ribeiro Afonso is doctoral-degree student in the Information Science program (Faculdade de Ciência da Informação – FCI), University of Brasília (Universidade de Brasília–UnB), Brasília - DF, Brazil. E-mail: rafonso.alex@gmail.com
  • Cláudio Gottschalg Duque
    Cláudio Gottschalg Duque works as professor and researcher in the Information Science program (Faculdade de Ciência da Informação – FCI), University of Brasília (Universidade de Brasília–UnB), Brasília - DF, Brazil. Campus Universitário Darcy Ribeiro, Faculdade de Ciência da Informação,Edifício da Biblioteca Central, Entrada Leste, Brasília, DF- Brazil. CEP: 70.919-970. Phone Number: 55(61)3107-2632. E-mail: klaussherzog@gmail.com

Downloads

Published

2014-08-21

Issue

Section

Articles

How to Cite

AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS. (2014). Journal of Information Systems and Technology Management, 11(2), 415-436. https://doi.org/10.4301/10.4301%2FS1807-17752014000200011