Sequential Clustering and Condensing the Meaning of Texts into Centroid Terms
When run, most traditional clustering algorithms require the number of clusters sought to be specied beforehand, and all clustered items to be present. These two, for practical applications very serious shortcomings are overcome by a straightforward sequential clustering algorithm. Its most crucial constituent is a distance measure whose suitable choice is discussed. It is shown how sequentially obtained cluster sets can be improved by reclustering, and how items
considered as outliers can be removed. As a case study, the feasibility of applying the method and a centroid-based distance measure to nd and group semantically similar documents in text analysis is investigated.