首页 | 本学科首页   官方微博 | 高级检索  
     

两种面向推荐系统的数据压缩方法
引用本文:刘博,刘晓光,王刚,吴迪. 两种面向推荐系统的数据压缩方法[J]. 计算机工程与科学, 2016, 38(11): 2183-2190
作者姓名:刘博  刘晓光  王刚  吴迪
作者单位:;1.南开大学计算机与控制工程学院;2.北京字节跳动科技有限公司
摘    要:今日头条的服务器每天都会产生规模庞大的训练数据,为方便进行训练,这些数据都具有特定的格式和分布特征。使用不同类型的通用压缩算法(字典类型及非字典类型)进行测试,发现单独任何一种算法都无法在满足业务需求(速率需求和CPU占比等)的同时获得较为可观的压缩比。针对今日头条的训练数据,提出了分段聚类压缩和Hash recoding压缩两种策略。实验结果表明,分段聚类压缩方式在更好地保证了压缩率的同时提高了压缩速率的目的;Hash recoding压缩方式则更好地达到了以少量压缩速率的损失换取更可观的压缩率的目的。分段聚类方式搭配Gzip压缩算法的组合能使压缩速率提高300%以上;Hash recoding方式匹配Snappy压缩算法能使压缩率缩小50%以上。根据实际需求,不论选择哪种策略,对于降低今日头条的运营成本,提升业务处理的效率,提供更好的用户体验,都有一定意义。

关 键 词:分段聚类压缩  哈希值编码压缩  字典压缩  训练数据  Gzip  Snappy
收稿时间:2016-07-11
修稿时间:2016-11-25

Two data compression methods for recommender systems
LIU Bo,LIU Xiao guang,WANG Gang,WU Di. Two data compression methods for recommender systems[J]. Computer Engineering & Science, 2016, 38(11): 2183-2190
Authors:LIU Bo  LIU Xiao guang  WANG Gang  WU Di
Affiliation:(1.College of Computer and Control Engineer,Nankai University,Tianjin 300350;2.Bytedance Inc.,Beijing 100085,China) 
Abstract:There is an enormous number of training data being generated in Headlines Today's sever. These data is formatted for Machine Learning. We observed that whichever common data compression method cannot perfectly satisfy business requirements: a better compression ratio. We present two methods for training data from Headlines Today’s sever. One is called hierarchical cluster compression (HCC), and the other is hash recoding compression (HRC). The HCC with Gzip Compression can quadruple the compression speed than pure Gzip Compression, which indicates that the first proposed method can effectively promote compression speed and guarantee the compression ratio as well; the HRC with Snappy Compression is able to halve the compression ratio in comparison with pure Snappy Compression, which shows that the HRC can reduce the compression ratio and lower the compression speed as little as possible. Above all, it is meaningful to choose whichever method for decreasing operation costs, promoting business processes efficiency and providing better user experience.
Keywords:hierarchical cluster compression  Hash recoding compression  dictionary compression  training data  Gzip  Snappy  
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号