首页 | 本学科首页   官方微博 | 高级检索  
     

Stack Overflow上机器学习相关问题的大规模实证研究
引用本文:万志远,陶嘉恒,梁家坤,才振功,苌程,乔林,周巧妮. Stack Overflow上机器学习相关问题的大规模实证研究[J]. 浙江大学学报(工学版), 2019, 53(5): 819-828. DOI: 10.3785/j.issn.1008-973X.2019.05.001
作者姓名:万志远  陶嘉恒  梁家坤  才振功  苌程  乔林  周巧妮
作者单位:1. 浙江大学 计算机科学与技术学院,浙江 杭州 3100272. 浙江大学 软件学院,浙江 宁波 3150483. 国网辽宁省电力有限公司 信息通信分公司,辽宁 沈阳 110006
摘    要:为了调查机器学习相关主题分布和发展趋势,从在线问答网站Stack Overflow上,利用过滤标签,从4 178多万帖子中提取出60 028个与机器学习相关的问题帖. 通过分析问题帖,统计各个机器学习平台的讨论量,发现Scikit-learn、TensorFlow、Keras是前3位频繁被讨论的机器学习平台,占总讨论量的58%. 为了进一步分析机器学习相关讨论主题,进行潜在狄利克雷分布(LDA)主题模型训练,提出自适应LDA中的主题数渐进搜索方法,采用主题一致性系数评估输出结果,获得主题最佳数量,从而发现9个讨论主题,分属3个类别:代码相关、模型相关、理论相关. 基于主题中问题帖的浏览数、评论数,分析不同主题的流行度和回答困难程度.

关 键 词:实证研究  机器学习  Stack Overflow  潜在狄利克雷分布(LDA)  主题一致性  

Large-scale empirical study on machine learning related questions on Stack Overflow
Zhi-yuan WAN,Jia-heng TAO,Jia-kun LIANG,Zhen-gong CAI,Cheng CHANG,Lin QIAO,Qiao-ni ZHOU. Large-scale empirical study on machine learning related questions on Stack Overflow[J]. Journal of Zhejiang University(Engineering Science), 2019, 53(5): 819-828. DOI: 10.3785/j.issn.1008-973X.2019.05.001
Authors:Zhi-yuan WAN  Jia-heng TAO  Jia-kun LIANG  Zhen-gong CAI  Cheng CHANG  Lin QIAO  Qiao-ni ZHOU
Abstract:By using filtered tags, 60 028 machine learning related questions were extracted from more than 41.78 million posts on an online Q & A website, Stack Overflow, in order to investigate the topic distribution and trends related to machine learning. Extracted question posts were analyzed by counting the amount of discussion on each machine learning platform, and top three most frequently discussed machine learning platforms were discovered, i.e. Scikit-learn, TensorFlow and Keras, accounting for 58% of these posts. Latent Dirichlet allocation (LDA) topic model training was conducted to further explore discussion topics related to machine learning. A progressive search approach was proposed for number of topics in adaptive LDA, which discovered the optimal number of topics with topic coherence coefficient, in order to obtain the optimal topic numbers for LDA models. Nine discussion topics related to machine learning were discovered, which fell into three broad categories, i.e. code-related, model-related, and theory-related. In addition, the popularity and difficulty of different topics were analyzed according to the view counts and comment counts of question posts.
Keywords:empirical research  machine learning  Stack Overflow  latent Dirichlet allocation (LDA)  topic coherence  
本文献已被 CNKI 等数据库收录!
点击此处可从《浙江大学学报(工学版)》浏览原始摘要信息
点击此处可从《浙江大学学报(工学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号