首页 | 本学科首页   官方微博 | 高级检索  
     

NEMTF:基于多维度文本特征的新闻网页信息提取方法
引用本文:翁彬月,秦永彬,黄瑞章,任丽娜,田悦霖.NEMTF:基于多维度文本特征的新闻网页信息提取方法[J].计算机应用研究,2022,39(4):1043-1048.
作者姓名:翁彬月  秦永彬  黄瑞章  任丽娜  田悦霖
作者单位:贵州大学 计算机科学与技术学院,贵阳550025;贵州省公共大数据重点实验室,贵阳550025,贵州大学 计算机科学与技术学院,贵阳550025;贵州省公共大数据重点实验室,贵阳550025;贵州轻工职业技术学院,贵阳550025
基金项目:贵州省科学技术基金重点资助项目;国家自然科学基金;贵州省科技重大专项计划资助项目
摘    要:目前主流的网页抽取方法存在两大问题:提取信息类型单一,难以获取多种类新闻信息;多依赖HTML标签,难以扩展至不同来源。为此提出一种基于多维度文本特征的新闻网页信息提取方法,利用新闻文本的写作特点划分出写作、语义和位置特征,通过多通道卷积神经网络融合为多维度文本特征,用于提取多种类新闻网页信息;仅需少量数据集训练,就可提取新来源的新闻网页信息。实验结果表明,该方法在性能上高于当前最优方法。

关 键 词:网页信息提取  卷积神经网络  Web挖掘  文本特征
收稿时间:2021/10/12 0:00:00
修稿时间:2022/3/14 0:00:00

NEMTF:method of news Web content extraction based on multi-dimensional text features
Weng Binyue,Qin Yongbin,Huang Ruizhang,Ren Lina and Tian Yuelin.NEMTF:method of news Web content extraction based on multi-dimensional text features[J].Application Research of Computers,2022,39(4):1043-1048.
Authors:Weng Binyue  Qin Yongbin  Huang Ruizhang  Ren Lina and Tian Yuelin
Affiliation:College of Computer Science and Technology, Guizhou University,,,,
Abstract:At present, there are two major problems in the mainstream webpage extraction methods: a) The extraction information type is single, and it is difficult to obtain multiple kinds of news information. b) More reliance on HTML tags, difficult to extend to different sources. Therefore, this paper proposed an information extraction method of news Web pages based on multidimensional text features. It divided writing, semantic and location features into writing features by using the writing features of news texts. And it used multi-channel convolutional neural network to fuse multi-dimensional text features to extract multiple types of news Web pages. Only a small amount of data set training was required to extract news Web page information from new sources. Experimental results show that the performance of this method is better than the current optimal method.
Keywords:Web content extraction  convolutional neural network(CNN)  Web mining  text feature
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号