A clustering with slope algorithm based on mapreduce
Wang Yuping
Wu Huiyun
Liang Jingmin
Huang Youfang
· 2016
期刊名称:
Journal of Digital Information Management
2016 年
14 卷
3 期
摘要:
The clustering with slope (CLOPE) algorithm is widely used to analyze transactional data because of its excellent performance, lower memory cost, and better quality of results compared with other clustering algorithms. However, the running time of the CLOPE algorithm in large datasets may take more than several days, which is unacceptable. To solve the time issue caused by the algorithm's serial running mode, a new parallel running mode needs to be introduced to the CLOPE algorithm to improve its efficiency. A CLOPE algorithm based on MapReduce is presented in this paper. The new algorithm was run in parallel on a Hadoop cluster with multiple nodes. The Hadoop platform split the large dataset into multiple small data blocks, and the CLOPE algorithm was run on each block to obtain small clusters. The modified cluster-oriented CLOPE algorithm then merged these small clusters to the expected number of clusters. Experiments show that CLOPE based on the MapReduce algorithm runs faster and more efficiently than the CLOPE algorithm and demonstrates the same quality of clustering. Time remained constant against data volume, and time complexity was only affected by the size of the Hadoop cluster. Thus, the proposed algorithm solves the time issue in clustering large datasets and can be utilized to cluster transactional trade data, website logs, DNS query logs in limited time, and even transactional data with high dimension.