首页 | 本学科首页   官方微博 | 高级检索  
     

基于SVM和扩展条件随机场的Web实体活动抽取
引用本文:张传岩,洪晓光,彭朝晖,李庆忠.基于SVM和扩展条件随机场的Web实体活动抽取[J].软件学报,2012,23(10):2612-2627.
作者姓名:张传岩  洪晓光  彭朝晖  李庆忠
作者单位:山东大学 计算机科学与技术学院,山东 济南 250101
基金项目:国家自然科学基金(61003051);国家科技支撑计划(2009BAH44B02);山东省自然科学基金(2009ZRB019RW);山东省科技攻关计划(2010GGX10108)
摘    要:在传统信息抽取的基础上,研究Web实体活动抽取,基于格语法对实体活动进行了形式化定义,并提出一种基于SVM(supported vector machine)和扩展条件随机场的Web实体活动抽取方法,能够从Web上准确地抽取实体的活动信息.首先,为了避免人工标注训练数据的繁重工作,提出一种基于启发式规则的训练数据生成算法,将语义角色标注的训练数据集转化为适合Web实体活动抽取的训练数据集,分别训练支持向量机分类器和扩展条件随机场.在抽取过程中,通过分类器获得包含实体活动的语句,然后利用扩展条件随机场对传统条件随机场中不能利用的标签频率特征和关系特征建模,标注自然语句中的待抽取信息,提高标注的准确率.通过多领域的实验,其结果表明,所提出的抽取方法能够较好地适用于Web实体活动抽取.

关 键 词:信息抽取  格语法  实体活动  支持向量机  扩展条件随机场
收稿时间:2011/8/15 0:00:00
修稿时间:2012/1/17 0:00:00

Extracting Web Entity Activities Based on SVM and Extended Conditional Random Fields
ZHANG Chuan-Yan,HONG Xiao-Guang,PENG Zhao-Hui and LI Qing-Zhong.Extracting Web Entity Activities Based on SVM and Extended Conditional Random Fields[J].Journal of Software,2012,23(10):2612-2627.
Authors:ZHANG Chuan-Yan  HONG Xiao-Guang  PENG Zhao-Hui and LI Qing-Zhong
Affiliation:(School of Computer Science and Technology,Shandong University,Ji’nan 250101,China)
Abstract:On the basis of the traditional methods extracting information, this paper defines the formal model ofentity activity based on case grammar and presents a method based on supported vector machine and extendedcondition random fields to extract Web entity activities accurately. First, in order to automatically train the machinelearning models, the study puts forward a heuristic method to transform the semantic role labeling training data intothe training data of entity activity extraction. Next, the study trains a support vector machine classifier and extendscondition random fields using the training data. Third, using the classifier, the study distinguishes the sentences thatcontain Web entity activities. The paper also proposes forward and extends condition random fields to model thefrequency and relationship feature. The traditional conditional random fields cannot model this while the new modelcan label the entity activity information in natural language sentences more accurately. Finally, the experimentalresults show that the method is effective in multidomains and can be applied to Web entity activity extraction.
Keywords:information extraction  case grammar  entity activity  support vector machine  extended conditionrandom fields
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号