首页 | 本学科首页   官方微博 | 高级检索  
     

以LDA为例的大规模分布式机器学习系统分析
引用本文:唐黎哲,冯大为,李东升,李荣春,刘锋.以LDA为例的大规模分布式机器学习系统分析[J].计算机应用,2017,37(3):628-634.
作者姓名:唐黎哲  冯大为  李东升  李荣春  刘锋
作者单位:1. 并行与分布处理国家重点实验室(国防科学技术大学), 长沙 410073;2. 国防科学技术大学 计算机学院, 长沙 410073
基金项目:国家自然科学基金资助项目(61222205)。
摘    要:针对构建大规模机器学习系统在可扩展性、算法收敛性能、运行效率等方面面临的问题,分析了大规模样本、模型和网络通信给机器学习系统带来的挑战和现有系统的应对方案。以隐含狄利克雷分布(LDA)模型为例,通过对比三款开源分布式LDA系统——Spark LDA、PLDA+和LightLDA,在系统资源消耗、算法收敛性能和可扩展性等方面的表现,分析各系统在设计、实现和性能上的差异。实验结果表明:面对小规模的样本集和模型,LightLDA与PLDA+的内存使用量约为Spark LDA的一半,系统收敛速度为Spark LDA的4至5倍;面对较大规模的样本集和模型,LightLDA的网络通信总量与系统收敛时间远小于PLDA+与SparkLDA,展现出良好的可扩展性。“数据并行+模型并行”的体系结构能有效应对大规模样本和模型的挑战;参数弱同步策略(SSP)、模型本地缓存机制和参数稀疏存储能有效降低网络开销,提升系统运行效率。

关 键 词:隐含狄利克雷分布  主题模型  文本聚类  吉布斯采样  变分贝叶斯推理  机器学习  
收稿时间:2016-09-21
修稿时间:2016-09-30

Analysis of large-scale distributed machine learning systems: a case study on LDA
TANG Lizhe,FENG Dawei,LI Dongsheng,LI Rongchun,LIU Feng.Analysis of large-scale distributed machine learning systems: a case study on LDA[J].journal of Computer Applications,2017,37(3):628-634.
Authors:TANG Lizhe  FENG Dawei  LI Dongsheng  LI Rongchun  LIU Feng
Affiliation:1. National Laboratory for Parallel and Distributed Processing(National University of Defense Technology), Changsha Hunan 410073, China;2. College of Computer, National University of Defense Technology, Changsha Hunan 410073, China
Abstract:Aiming at the problems of scalability, algorithm convergence performance and operational efficiency in building large-scale machine learning systems, the challenges of the large-scale sample, model and network communication to the machine learning system were analyzed and the solutions of the existing systems were also presented. Taking Latent Dirichlet Allocation (LDA) model as an example, by comparing three open source distributed LDA systems-Spark LDA, PLDA+ and LightLDA, the differences in system design, implementation and performance were analyzed in terms of system resource consumption, algorithm convergence performance and scalability. The experimental results show that the memory usage of LightLDA and PLDA+ is about half of Spark LDA, and the convergence speed is 4 to 5 times of Spark LDA in the face of small sample sets and models. In the case of large-scale sample sets and models, the network communication volume and system convergence time of LightLDA is much smaller than PLDA+ and SparkLDA, showing a good scalability. The model of "data parallelism+model parallelism" can effectively meet the challenge of large-scale sample and model. The mechanism of Stale Synchronous Parallel (SSP) model for parameters, local caching mechanism of model and sparse storage of parameter can reduce the network cost effectively and improve the system operation efficiency.
Keywords:Latent Dirichlet Allocation (LDA)  topic model  text clustering  Gibbs sampling  variational Bayes inference  machine learning  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号