首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于模拟登录的微博数据采集方案
引用本文:孙青云,王俊峰,赵宗渠,高梦超.一种基于模拟登录的微博数据采集方案[J].计算机技术与发展,2014(3):6-10.
作者姓名:孙青云  王俊峰  赵宗渠  高梦超
作者单位:[1] 四川大学 计算机学院,四川 成都610065 [2] 四川大学 计算机学院,四川 成都610065; 视觉合成图形图像技术重点实验室,四川 成都610064
基金项目:国家科技重大专项(2012ZX10004-901001);国家自然科学基金资助项目(11102124)
摘    要:随着Web 2.0时代的到来,舆情信息在微博上能够更快速的产生和传播。为了有效地分析微博舆情信息,微博数据的获取显得尤为重要。文中以新浪微博为研究对象,提出了基于模拟登录的网络爬虫采集方案。此方案解决了调用微博API接口对开发者的次数限制,解决了传统的网络爬虫需要身份验证的问题,加快了微博数据的采集速度,可以在短时间内获得海量的微博数据。实验表明,用该方案开发的系统具有快速的微博信息采集速度,更加灵活,可以很好地为舆情系统分析提供大量准确的数据支持。

关 键 词:模拟登录技术  网络爬虫

A Microblog Data Collection Method Based on Simulated Login Technology
SUN Qing-yun,WANG Jun-feng,ZHAO Zong-qu,GAO Meng-chao.A Microblog Data Collection Method Based on Simulated Login Technology[J].Computer Technology and Development,2014(3):6-10.
Authors:SUN Qing-yun  WANG Jun-feng  ZHAO Zong-qu  GAO Meng-chao
Affiliation:1. College of Computer Science, Sichuan University, Chengdu 610065, China; 2. National Key Laboratory of Fundamental Science on Synthetic Vision ,Chengdu 610064 ,China)
Abstract:Public sentiment information on the microblog generates rapidly and disseminates widely resulting from the coming era of Web 2. 0. Now the information collection is becoming more and more important in analyzing public sentiment. A Web crawler based on simu-lated login technology on the Sina microblog is presented. In the crawler,resolve the limiting numbers of calling microblog API interface for developer,meanwhile it provides a solution for the authentication of traditional Web crawler. It can collect huge amount of data in the short-term because of accelerated progress of collection. According to the result of experiments,this system can improve the microblog in-formation collection speed and become more flexible that can provide accurate data for the public sentiment analysis system.
Keywords:微博API  microblog API  simulated login technology  Web crawler
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号