[PDF][PDF] Imbalanced k-means: An algorithm to cluster imbalanced-distributed data
CNS Kumar, KN Rao, A Govardhan… - International Journal of …, 2014 - academia.edu
International Journal of Engineering and Technical Research, 2014•academia.edu
K-means is a partitional clustering technique that iswell-known and widely used for its low
computational cost. However, the performance of k-means algorithm tends to beaffected by
skewed data distributions, ie, imbalanced data. Theyoften produce clusters of relatively
uniform sizes, even if input datahave varied a cluster size, which is called the “uniform
effect.” Inthis paper, we analyze the causes of this effect and illustrate thatit probably occurs
more in the k-means clustering process. As the minority class decreases in size, the “uniform …
computational cost. However, the performance of k-means algorithm tends to beaffected by
skewed data distributions, ie, imbalanced data. Theyoften produce clusters of relatively
uniform sizes, even if input datahave varied a cluster size, which is called the “uniform
effect.” Inthis paper, we analyze the causes of this effect and illustrate thatit probably occurs
more in the k-means clustering process. As the minority class decreases in size, the “uniform …
Abstract
K-means is a partitional clustering technique that iswell-known and widely used for its low computational cost. However, the performance of k-means algorithm tends to beaffected by skewed data distributions, ie, imbalanced data. Theyoften produce clusters of relatively uniform sizes, even if input datahave varied a cluster size, which is called the “uniform effect.” Inthis paper, we analyze the causes of this effect and illustrate thatit probably occurs more in the k-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent theeffect of the “uniform effect”, we revisit the well-known K-means algorithmand provide a general method to properly cluster imbalance distributed data. We present Imbalanced K-Means (IKM), a multi-purpose partitional clustering procedure that minimizes the clustering sum of squared error criterion, while imposing a hard sequentiality constraint in theclustering step. The proposed algorithm consists of a novel oversampling technique implemented by removing noisy and weak instances from both majority and minority classes and then oversampling only novel minority instances. We conduct experiments using twelve UCI datasets from various application domains using fivealgorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
academia.edu
以上显示的是最相近的搜索结果。 查看全部搜索结果