基于MapReduce的大規模文本聚類并行化

武森; 馮小東; 楊杰; 張曉楠

doi:10.13374/j.issn1001-053x.2014.10.019

基于MapReduce的大規模文本聚類并行化

doi: 10.13374/j.issn1001-053x.2014.10.019

北京科技大學東凌經濟管理學院, 北京 100083

基金項目:

國家自然科學基金資助項目（71271027）；高等學校博士學科點專項科研基金資助項目（20120006110037）；中央高校基本科研業務費專項資金資助項目（FＲF--TP--10--006B）

詳細信息

通訊作者:
武森,E-mail:wusen@manage.ustb.edu.cn

中圖分類號: TP391
計量
- 文章訪問數: 181
- HTML全文瀏覽量: 36
- PDF下載量: 7
- 被引次數: 0
出版歷程
- 收稿日期: 2013-09-30
- 網絡出版日期: 2021-07-19

Parallel clustering of very large document datasets with MapReduce

Dongling School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China

摘要

摘要: 建立快速有效的針對大規模文本數據的聚類分析方法是當前數據挖掘研究和應用領域中的一個熱點問題.為了同時保證聚類效果和提高聚類效率，提出基于"互為最小相似度文本對"搜索的文本聚類算法及分布式并行計算模型.首先利用向量空間模型提出一種文本相似度計算方法；其次，基于"互為最小相似度文本對"搜索選擇二分簇中心，提出通過一次劃分實現簇質心尋優的二分K-means聚類算法；最后，基于MapReduce框架設計面向云計算應用的大規模文本并行聚類模型.在Hadoop平臺上運用真實文本數據的實驗表明：提出的聚類算法與原始二分K-means相比，在獲得相當聚類效果的同時，具有明顯效率優勢；并行聚類模型在不同數據規模和計算節點數目上具有良好的擴展性.
- 云計算 /
- 文本 /
- 聚類 /
- 相似度
Abstract: To develop fast and efficient methods to cluster mass document data is one of the hot issues of current data mining research and applications. In order to ensure the clustering result and simultaneously improve the clustering efficiency, a document clustering algorithm was proposed based on searching a document pair with minimum similarity for each other and its distributed parallel computing models were provided. Firstly a document similarity measure was presented using a vector space model (VSM); then bisecting clustering was raised combining the bisecting K-means and the proposed initial cluster center selection approach to find the optimized cluster centroids by once partitioning; finally a distributed parallel document clustering model was designed for cloud computing based on MapReduce framework. Experiments on Hadoop platform, using real document datasets, showed the obvious efficiency advantages of the novel document clustering algorithm compared to the original bisecting K-means with an equivalent clustering result, and the scalability of parallel clustering with different data sizes and different computation node numbers was also evaluated.
- cloud computing /
- documents /
- clustering /
- similarity