作者
Bin Tang, Xiao Luo, Malcolm I Heywood, Michael Shepherd
发表日期
2004/12/6
期刊
Technical Report CS-2004-14
简介
Dimension reduction techniques (DRT) are applicable to a wide range of information systems. Application context naturally has a significant impact on the appropriateness of the DRTs. In this research, a systematic study is conducted of four DRTs for the text clustering problem using five benchmark datasets. Of the four methods--Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP)--ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of the datasets. Random projection consistently returns the worst results, where this appears to be due to the noise distribution characterizing the document clustering task.
引用总数
2004200520062007200820092010201120122013201420152016201715311111
学术搜索中的文章
B Tang, X Luo, MI Heywood, M Shepherd - Technical Report CS-2004-14, 2004