首页 | 本学科首页   官方微博 | 高级检索  
     

一种Web评论自动抽取方法
引用本文:刘伟,严华梁,肖建国,曾建勋. 一种Web评论自动抽取方法[J]. 软件学报, 2010, 21(12): 3220-3236. DOI: 10.3724/SP.J.1001.2010.03961
作者姓名:刘伟  严华梁  肖建国  曾建勋
基金项目:Supported by the National High-Tech Research and Development Plan of China under Grant No.2008AA01Z421 (国家高技术研究发展计划(863)); the China Postdoctoral Science Foundation Funded Project under Grant Nos.20080440256, 200902014 (中国博士后科学基金)
摘    要:Web用户评论是许多重要应用的信息来源,比如公众舆情的检测与分析,Web用户评论必须从网页中准确地抽取出来.用户生成内容(user-generated content)不受页面模板的限制,这就给Web数据抽取提出了新的挑战:首先,不同用户评论内容的不一致性严重影响了评论记录在DOM树和视觉上的相似性;其次,评论内容在DOM树中是一棵复杂的子树,而且彼此之间在DOM树中的结构相差巨大.为了解决这两个问题,提出了一种完整的解决方案,使用多种技术来实现对用户评论内容的抽取.抽取过程分为两个步骤,基于深度加权的树相似性算法评论记录首先从网页中抽取出来,然后通过比较DOM树中节点的一致性,将纯粹的用户评论内容从评论记录中抽取出来.在多个新闻网站和论坛网站上的实验结果表明,该方法可以达到较高的准确度和效率.

关 键 词:Web用户评论  结构化数据记录  Web数据抽取
收稿时间:2010-09-06
修稿时间:2010-11-24

Solution for Automatic Web Review Extraction
LIU Wei,YAN Hua-Liang,XIAO Jian-Guo and ZENG Jian-Xun. Solution for Automatic Web Review Extraction[J]. Journal of Software, 2010, 21(12): 3220-3236. DOI: 10.3724/SP.J.1001.2010.03961
Authors:LIU Wei  YAN Hua-Liang  XIAO Jian-Guo  ZENG Jian-Xun
Abstract:Web user reviews are the important information source for many popular applications (e.g. monitoring and analysis of public opinion), and they need to be extracted accurately from Web pages. Web user reviews belong to user-generated contents, whose presentation is not restricted by the Web page template. Therefore new challenges are raised. First, the inconsistency of review contents on both DOM tree and visual appearance impair the similarity between review records; second, the review content in a review record corresponds to a complicated subtree rather than one single node in the DOM tree. To tackle these challenges, a comprehensive solution is proposed to perform automatic extraction of Web reviews by employing sophisticated techniques. The review records are extracted from Web pages based on the level-weighted tree similarity algorithm first, and then, the pure review contents in records are extracted by comparing the node consistency. The experimental results on news Web sites and forum Web sites indicate that our solution can achieve high extraction accuracy and efficiency.
Keywords:Web user review   structured data record   Web data extraction
本文献已被 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号